ABSTRACT
Speaker clustering is the task of grouping a set of speech utterances into speaker-specific classes. The basic techniques for solving this task are similar to those used for speaker verification and identification. The hypothesis of this paper is that the techniques originally developed for speaker verification and identification are not sufficiently discriminative for speaker clustering. However, the processing chain for speaker clustering is quite large - there are many potential areas for improvement. The question is: where should improvements be made to improve the final result? To answer this question, this paper takes a biomimetic approach based on a study with human participants acting as an automatic speaker clustering system. Our findings are twofold: it is the stage of modeling that has the highest potential, and information with respect to the temporal succession of frames is crucially missing. Experimental results with our implementation of a speaker clustering system incorporating our findings and applying it on TIMIT data show the validity of our approach.
- A. G. Adami. Modeling Prosodic Di erences for Speaker Recognition. Speech Communication, 49:277--291, 2007. Google ScholarDigital Library
- J.-J. Aucouturier. A Day in the Life of a Gaussian Mixture Model: Informing Music Pattern Recognition with Psychological Experiments. Journal of New Music Research, submitted, 2009.Google Scholar
- J.-J. Aucouturier and F. Pachet. Improving Timbre Similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1), 2004.Google Scholar
- Y. Bar-Cohen. Biomimetics: Biologically Inspired Technologies. CRC Press, Boca Raton, FL, USA, 2006.Google Scholar
- H. Beigi, S. Maes, and J. Sorensen. A Distance Measure Between Collections of Distributions and its Application to Speaker Recognition. In IEEE Proc. of ICASSP, volume 2, pages 753--756, 1998.Google ScholarCross Ref
- J. Benesty, M. M. Sondhi, and Y. Huang. Springer Handbook of Speech Processing. Springer, Germany, 2008. Google ScholarDigital Library
- C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006. Google ScholarDigital Library
- J. P. Campbell. Speaker Recognition: A Tutorial. Proceedings of the IEEE, 85:1437--1462, 1997.Google ScholarCross Ref
- C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
- Z.-H. Chen, Y.-F. Liao, and Y.-T. Juang. Prosody Modeling and Eigen-Prosody Analysis for Robust Speaker Recognition. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'05, pages I-185-I-188, 2005.Google Scholar
- R. Chengalvarayan and L. Deng. Speech Trajectory Discrimination Using the Minimum Classification Error Learning. IEEE Transactions on Speech and Audio Processing, 6(6), 1998.Google ScholarCross Ref
- S. Davis and P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28:357--366, 1980.Google ScholarCross Ref
- K. Demuynck, O. Garcia, and D. V. Compernolle. Synthesizing Speech from Speech Recognition Parameters. In Proc. International Conference on Spoken Language Processing, Jeju Island, Korea, volume II, pages 945--948, 2004.Google Scholar
- F. Desobry, M. Davy, and C. Doncarli. An Online Kernel Change Detection Algorithm. IEEE Transactions on Signal Processing, 53(8), 2005. Google ScholarDigital Library
- F. Desobry, M. Davy, and W. J. Fitzgerald. A Class of Kernels for Sets of Vectors. In Proceedings of ESANN'2005, pages 461--466. MIT Press, 2005.Google Scholar
- M. Faundez-Zanuy and E. Monte-Moreno. State-of-the-Art in Speaker Recognition. IEEE Aerospace and Electronic Systems Magazine, 20:7--12, 2005.Google ScholarCross Ref
- B. Fergani, M. Davy, and A. Houacine. Speaker Diarization using One-Class Support Vector Machines. Speech Communication, 50:355--365, 2008. Google ScholarDigital Library
- L. Ferrer, H. Bratt, V. R. R. Gadde, S. Kajarekar, E. Shriberg, K. Sonmez, A. Stolcke, and A. Venkataraman. Modeling Duration Patterns for Speaker Recognition. In Proceedings of EUROSPEECH, pages 2017--2020, 2003.Google Scholar
- W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall. The DARPA Speech Recognition Research Database: Specification and Status. In Proceedings of the DARPA Speech Recognition Workshop, Report No. SAIC-86/1546, February 1986, Palo-Alto, 1986.Google Scholar
- S. Furui. 50 Years of Progress in Speech and Speaker Recognition. In Proc. SPECOM 2005, Patras, Greece, pages 1--9, 2005.Google Scholar
- B. Goertzel and C. Pennachin. Artificial General Intelligence. Springer, Berlin, Heidelberg, Germany, 2007. Google ScholarDigital Library
- D. W. Griffin and J. S. Lim. Signal Estimation from Modified Short-Time Fourier Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32:236--243, 1984.Google ScholarCross Ref
- K. J. Han, S. Kim, and S. S. Narayanan. Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization. IEEE Transactions on Audio, Speech, and Language Processing, 16:1590--1601, 2008. Google ScholarDigital Library
- H. Jin, F. Kubala, and R. Schwartz. Automatic Speaker Clustering. In Proc. of the DARPA Speech Recognition Workshop, pages 108--111, 1997.Google Scholar
- C. Joder, S. Essid, and G. Richard. Temporal Integration for Audio Classification With Application to Musical Instrument Classification. IEEE Transactions on Audio, Speech, and Language Processing, 17:174--186, 2009. Google ScholarDigital Library
- D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd Edn. Addison Wesley, 1998. Google ScholarDigital Library
- M. Kotti, E. Benetos, and C. Kotropoulos. Computationally Efficient and Robust BIC-Based Speaker Segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 16:920--933, 2008. Google ScholarDigital Library
- M. Kotti, V. Moschou, and C. Kotropoulos. Speaker Segmentation and Clustering. Signal Processing, 88:1091--1124, 2008. Google ScholarDigital Library
- H.-J. Z. Lie Lu. Unsupervised Speaker Segmentation and Tracking in Real-Time Audio Content Analysis. Multimedia Systems, 10:332--343, 2005.Google Scholar
- B. Lindblom, R. Diehl, and C. Creeger. Do 'Dominant Frequencies' Explain the Listener's Response to Formant and Spectrum Shape Variations? Speech Communication, 2008. Google ScholarDigital Library
- J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava. Speech and Language Technologies for Audio Indexing and Retrieval. Proceedings of the IEEE, 88:1338--1353, 2000.Google ScholarCross Ref
- A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, and S. Pillay. Discrimination E ectiveness of Speech Cepstral Features. Lecture Notes in Computer Science, 5372:91--99, 2008. Google ScholarDigital Library
- L. Mary and B. Yegnanarayana. Extraction and Representation of Prosodic Features. Speech Communication, 2008. Google ScholarDigital Library
- S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Besacier. Step-by-Step and Integrated Approaches in Broadcast News Speaker Diarization. Computer Speech and Language, 20:303--330, 2006.Google ScholarCross Ref
- B. Milner and X. Shao. Speech Reconstruction from Mel-Frequency Cepstral Coefficients using a Source-Filter Model. In International Conference on Spoken Language Processing (ICSLP), pages 2421--2424, 2002.Google Scholar
- B. Milner and X. Shao. Clean Speech Reconstruction from MFCC Vectors and Fundamental Frequency using an Integrated Front-End. Speech Communication, 48:697--715, 2006.Google ScholarCross Ref
- T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997. Google ScholarDigital Library
- B. C. J. Moore. Psychology of Hearing, Fifth Edition. Elsevier Academic Press, London, UK, 2004.Google Scholar
- A. Morris, D. Wu, and J. Koreman. GMM based Clustering and Speaker Separability in the TIMIT Speech Database. Technical Report Saar-IP-08-08-2004, Saarland University, 2004.Google Scholar
- F. Pachet and P. Roy. Exploring Billions of Audio Features. In Eurasip, editor, Proceedings of CBMI 07, pages 227--235, 2007.Google Scholar
- S. M. Prasanna, C. S. Gupta, and B. Yegnanarayana. Extraction of Speaker-Specific Excitation Information from Linear Prediction Residual of Speech. Speech Communication, 48:1243--1261, 2006.Google ScholarCross Ref
- M. Przybocki and A. Martin. NIST Speaker Recognition Evaluation Chronicles. In Proceedings in Odyssey 2004, 2004.Google Scholar
- D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and B. Xiang. The SuperSID Project: Exploiting High-Level Information for High-Accuracy Speaker Recognition. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'03, pages IV-784-IV-787, 2003.Google Scholar
- D. Reynolds, W. Campbell, T. Gleason, C. Quillen, D. Sturim, P. Torres-Carrasquillo, and A. Adami. The 2004 MIT Lincoln Laboratory Speaker Recognition System. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'05, pages I-177-I-180, 2005.Google Scholar
- D. Reynolds and P. Torres-Carrasquillo. The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast News and Telephone Conversations. In NIST Rich Transcription Workshop November 2004, 2004.Google Scholar
- D. Reynolds and P. Torres-Carrasquillo. Approaches and Applications of Audio Diarization. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing 2005, volume 5, pages V-953--V-956, 2005.Google ScholarCross Ref
- D. A. Reynolds. Speaker Identification and Verification using Gaussian Mixture Speaker Models. Speech Communication, 17:91--108, 1995. Google ScholarDigital Library
- D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10:19--41, 2000.Google ScholarDigital Library
- D. A. Reynolds and R. C. Rose. Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, 3:72--83, 1995.Google ScholarCross Ref
- P. Rose. Forensic Speaker Identification. Taylor&Francis, London and New York, 2002.Google Scholar
- L. Saul and M. Rahim. Markov Processes on Curves for Automatic Speech Recognition. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 751--757. MIT Press, 1999. Google ScholarDigital Library
- B. Schouten, M. Tistarelli, C. Garcia-Mateo, F. Deravi, and M. Meints. Nineteen Urgent Research Topics in Biometrics and Identity Management. Lecture Notes in Computer Science, 5372:228--235, 2008. Google ScholarDigital Library
- C. C. Sekhar and M. Panaliswami. Classification of Multidimensional Trajectories for Acoustic Modeling Using Support Vector Machines. In Proceedings of ICISIP'04, pages 153--158, 2004.Google ScholarCross Ref
- S. W. Smith. Digital Signal Processing - A Practical Guide for Engineers and Scientists. Newnes, USA, 2003.Google Scholar
- M. K. Soenmez, L. Heck, M. Weintraub, and E. Shriberg. A Lognormal Tied Mixture Model of Pitch for Prosody-Based Speaker Recognition. In Proceedings of EUROSPEECH, pages 1391--1394, 1997.Google Scholar
- T. Su and J. G. Dy. In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering. Intelligent Data Analysis, 11:319--338, 2007. Google ScholarDigital Library
- D. Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W. B. Klejin and K. K. Paliwal, editors, Speech Coding and Synthesis, chapter 3, pages 495--518. Elsevier Science, Amsterdam, NL, 1995.Google Scholar
- D. M. J. Tax. One-Class Classification - Concept-Learning in the Absence of Counter-Examples. PhD thesis, Technische Universteit Delft, 2001.Google Scholar
- T. Thiruvaran, E. Ambikairajah, and J. Epps. Group Delay Features for Speaker Recognition. In 6th International Conference on Information, Communications&Signal Processing, pages 1--5, 2007.Google Scholar
- S. E. Tranter and D. A. Reynolds. An Overview of Automatic Speaker Diarization Systems. IEEE Transactions on Audio, Speech, and Language Processing, 14:1557--1565, 2006. Google ScholarDigital Library
- W.-H. Tsai, S.-S. Chen, and H.-M. Wang. Automatic Speaker Clustering using a Voice Characteristic Reference Space and Maximum Purity Estimation. IEEE Transactions on Audio, Speech, and Language Processing, 15:1461--1474, 2007. Google ScholarDigital Library
- D. A. van Leeuwen, A. F. Martin, M. A. Przybocki, and J. S. Bouten. NIST and NFI-TNO Evaluations of Automatic Speaker Recognition. Computer Speech and Language, 20:128--158, 2006.Google ScholarCross Ref
- M. Vlachos, G. Kollios, and D. Gunopulos. Discovering Similar Multidimensional Trajectories. In Proceedings of ICDE'02, pages 673--684, 2002. Google ScholarDigital Library
- D. Wu. Discriminative Preprocessing of Speech: Towards Improving Biometric Authentication. PhD thesis, Saarland University, 2006.Google Scholar
- D. Wu, J. Li, and H. Wu. α-Gaussian Mixture Modelling for Speaker Recognition. Pattern Recognition Letters, 2009. Google ScholarDigital Library
- S. Zhang, W. Hu, T. Wang, J. Liu, and Y. Zhang. Speaker Clustering Aided by Visual Dialogue Analysis. In PCM 2008, Lecture Notes on Computer Science, volume 5353, pages 693--702, 2008. Google ScholarDigital Library
Index Terms
- Unfolding speaker clustering potential: a biomimetic approach
Recommendations
Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM
We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style'...
Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients
There are several known feature sets for text-independent speaker-identification systems, most of which depend on spectral information. Among these feature sets as a most successful one, there is the set of the Mel-Frequency Cepstrum Coefficients MFCC. ...
Text-Independent speaker identification in phoneme-independent subspace using PCA transformation
In this paper we proposed a text-independent (TI) speaker identification method that suppresses the phonetic information by a subspace method, under the assumption that a subspace with large variance in the speech feature space is a 'phoneme-dependent ...
Comments