skip to main content
10.1145/1631272.1631300acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unfolding speaker clustering potential: a biomimetic approach

Published:19 October 2009Publication History

ABSTRACT

Speaker clustering is the task of grouping a set of speech utterances into speaker-specific classes. The basic techniques for solving this task are similar to those used for speaker verification and identification. The hypothesis of this paper is that the techniques originally developed for speaker verification and identification are not sufficiently discriminative for speaker clustering. However, the processing chain for speaker clustering is quite large - there are many potential areas for improvement. The question is: where should improvements be made to improve the final result? To answer this question, this paper takes a biomimetic approach based on a study with human participants acting as an automatic speaker clustering system. Our findings are twofold: it is the stage of modeling that has the highest potential, and information with respect to the temporal succession of frames is crucially missing. Experimental results with our implementation of a speaker clustering system incorporating our findings and applying it on TIMIT data show the validity of our approach.

References

  1. A. G. Adami. Modeling Prosodic Di erences for Speaker Recognition. Speech Communication, 49:277--291, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J.-J. Aucouturier. A Day in the Life of a Gaussian Mixture Model: Informing Music Pattern Recognition with Psychological Experiments. Journal of New Music Research, submitted, 2009.Google ScholarGoogle Scholar
  3. J.-J. Aucouturier and F. Pachet. Improving Timbre Similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1), 2004.Google ScholarGoogle Scholar
  4. Y. Bar-Cohen. Biomimetics: Biologically Inspired Technologies. CRC Press, Boca Raton, FL, USA, 2006.Google ScholarGoogle Scholar
  5. H. Beigi, S. Maes, and J. Sorensen. A Distance Measure Between Collections of Distributions and its Application to Speaker Recognition. In IEEE Proc. of ICASSP, volume 2, pages 753--756, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  6. J. Benesty, M. M. Sondhi, and Y. Huang. Springer Handbook of Speech Processing. Springer, Germany, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. P. Campbell. Speaker Recognition: A Tutorial. Proceedings of the IEEE, 85:1437--1462, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  9. C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z.-H. Chen, Y.-F. Liao, and Y.-T. Juang. Prosody Modeling and Eigen-Prosody Analysis for Robust Speaker Recognition. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'05, pages I-185-I-188, 2005.Google ScholarGoogle Scholar
  11. R. Chengalvarayan and L. Deng. Speech Trajectory Discrimination Using the Minimum Classification Error Learning. IEEE Transactions on Speech and Audio Processing, 6(6), 1998.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Davis and P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28:357--366, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  13. K. Demuynck, O. Garcia, and D. V. Compernolle. Synthesizing Speech from Speech Recognition Parameters. In Proc. International Conference on Spoken Language Processing, Jeju Island, Korea, volume II, pages 945--948, 2004.Google ScholarGoogle Scholar
  14. F. Desobry, M. Davy, and C. Doncarli. An Online Kernel Change Detection Algorithm. IEEE Transactions on Signal Processing, 53(8), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Desobry, M. Davy, and W. J. Fitzgerald. A Class of Kernels for Sets of Vectors. In Proceedings of ESANN'2005, pages 461--466. MIT Press, 2005.Google ScholarGoogle Scholar
  16. M. Faundez-Zanuy and E. Monte-Moreno. State-of-the-Art in Speaker Recognition. IEEE Aerospace and Electronic Systems Magazine, 20:7--12, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  17. B. Fergani, M. Davy, and A. Houacine. Speaker Diarization using One-Class Support Vector Machines. Speech Communication, 50:355--365, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Ferrer, H. Bratt, V. R. R. Gadde, S. Kajarekar, E. Shriberg, K. Sonmez, A. Stolcke, and A. Venkataraman. Modeling Duration Patterns for Speaker Recognition. In Proceedings of EUROSPEECH, pages 2017--2020, 2003.Google ScholarGoogle Scholar
  19. W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall. The DARPA Speech Recognition Research Database: Specification and Status. In Proceedings of the DARPA Speech Recognition Workshop, Report No. SAIC-86/1546, February 1986, Palo-Alto, 1986.Google ScholarGoogle Scholar
  20. S. Furui. 50 Years of Progress in Speech and Speaker Recognition. In Proc. SPECOM 2005, Patras, Greece, pages 1--9, 2005.Google ScholarGoogle Scholar
  21. B. Goertzel and C. Pennachin. Artificial General Intelligence. Springer, Berlin, Heidelberg, Germany, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. W. Griffin and J. S. Lim. Signal Estimation from Modified Short-Time Fourier Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32:236--243, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  23. K. J. Han, S. Kim, and S. S. Narayanan. Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization. IEEE Transactions on Audio, Speech, and Language Processing, 16:1590--1601, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Jin, F. Kubala, and R. Schwartz. Automatic Speaker Clustering. In Proc. of the DARPA Speech Recognition Workshop, pages 108--111, 1997.Google ScholarGoogle Scholar
  25. C. Joder, S. Essid, and G. Richard. Temporal Integration for Audio Classification With Application to Musical Instrument Classification. IEEE Transactions on Audio, Speech, and Language Processing, 17:174--186, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd Edn. Addison Wesley, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Kotti, E. Benetos, and C. Kotropoulos. Computationally Efficient and Robust BIC-Based Speaker Segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 16:920--933, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Kotti, V. Moschou, and C. Kotropoulos. Speaker Segmentation and Clustering. Signal Processing, 88:1091--1124, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H.-J. Z. Lie Lu. Unsupervised Speaker Segmentation and Tracking in Real-Time Audio Content Analysis. Multimedia Systems, 10:332--343, 2005.Google ScholarGoogle Scholar
  30. B. Lindblom, R. Diehl, and C. Creeger. Do 'Dominant Frequencies' Explain the Listener's Response to Formant and Spectrum Shape Variations? Speech Communication, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava. Speech and Language Technologies for Audio Indexing and Retrieval. Proceedings of the IEEE, 88:1338--1353, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  32. A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, and S. Pillay. Discrimination E ectiveness of Speech Cepstral Features. Lecture Notes in Computer Science, 5372:91--99, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Mary and B. Yegnanarayana. Extraction and Representation of Prosodic Features. Speech Communication, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Besacier. Step-by-Step and Integrated Approaches in Broadcast News Speaker Diarization. Computer Speech and Language, 20:303--330, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  35. B. Milner and X. Shao. Speech Reconstruction from Mel-Frequency Cepstral Coefficients using a Source-Filter Model. In International Conference on Spoken Language Processing (ICSLP), pages 2421--2424, 2002.Google ScholarGoogle Scholar
  36. B. Milner and X. Shao. Clean Speech Reconstruction from MFCC Vectors and Fundamental Frequency using an Integrated Front-End. Speech Communication, 48:697--715, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  37. T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. C. J. Moore. Psychology of Hearing, Fifth Edition. Elsevier Academic Press, London, UK, 2004.Google ScholarGoogle Scholar
  39. A. Morris, D. Wu, and J. Koreman. GMM based Clustering and Speaker Separability in the TIMIT Speech Database. Technical Report Saar-IP-08-08-2004, Saarland University, 2004.Google ScholarGoogle Scholar
  40. F. Pachet and P. Roy. Exploring Billions of Audio Features. In Eurasip, editor, Proceedings of CBMI 07, pages 227--235, 2007.Google ScholarGoogle Scholar
  41. S. M. Prasanna, C. S. Gupta, and B. Yegnanarayana. Extraction of Speaker-Specific Excitation Information from Linear Prediction Residual of Speech. Speech Communication, 48:1243--1261, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  42. M. Przybocki and A. Martin. NIST Speaker Recognition Evaluation Chronicles. In Proceedings in Odyssey 2004, 2004.Google ScholarGoogle Scholar
  43. D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and B. Xiang. The SuperSID Project: Exploiting High-Level Information for High-Accuracy Speaker Recognition. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'03, pages IV-784-IV-787, 2003.Google ScholarGoogle Scholar
  44. D. Reynolds, W. Campbell, T. Gleason, C. Quillen, D. Sturim, P. Torres-Carrasquillo, and A. Adami. The 2004 MIT Lincoln Laboratory Speaker Recognition System. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'05, pages I-177-I-180, 2005.Google ScholarGoogle Scholar
  45. D. Reynolds and P. Torres-Carrasquillo. The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast News and Telephone Conversations. In NIST Rich Transcription Workshop November 2004, 2004.Google ScholarGoogle Scholar
  46. D. Reynolds and P. Torres-Carrasquillo. Approaches and Applications of Audio Diarization. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing 2005, volume 5, pages V-953--V-956, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  47. D. A. Reynolds. Speaker Identification and Verification using Gaussian Mixture Speaker Models. Speech Communication, 17:91--108, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10:19--41, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. D. A. Reynolds and R. C. Rose. Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, 3:72--83, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  50. P. Rose. Forensic Speaker Identification. Taylor&Francis, London and New York, 2002.Google ScholarGoogle Scholar
  51. L. Saul and M. Rahim. Markov Processes on Curves for Automatic Speech Recognition. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 751--757. MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. B. Schouten, M. Tistarelli, C. Garcia-Mateo, F. Deravi, and M. Meints. Nineteen Urgent Research Topics in Biometrics and Identity Management. Lecture Notes in Computer Science, 5372:228--235, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. C. C. Sekhar and M. Panaliswami. Classification of Multidimensional Trajectories for Acoustic Modeling Using Support Vector Machines. In Proceedings of ICISIP'04, pages 153--158, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  54. S. W. Smith. Digital Signal Processing - A Practical Guide for Engineers and Scientists. Newnes, USA, 2003.Google ScholarGoogle Scholar
  55. M. K. Soenmez, L. Heck, M. Weintraub, and E. Shriberg. A Lognormal Tied Mixture Model of Pitch for Prosody-Based Speaker Recognition. In Proceedings of EUROSPEECH, pages 1391--1394, 1997.Google ScholarGoogle Scholar
  56. T. Su and J. G. Dy. In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering. Intelligent Data Analysis, 11:319--338, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. D. Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W. B. Klejin and K. K. Paliwal, editors, Speech Coding and Synthesis, chapter 3, pages 495--518. Elsevier Science, Amsterdam, NL, 1995.Google ScholarGoogle Scholar
  58. D. M. J. Tax. One-Class Classification - Concept-Learning in the Absence of Counter-Examples. PhD thesis, Technische Universteit Delft, 2001.Google ScholarGoogle Scholar
  59. T. Thiruvaran, E. Ambikairajah, and J. Epps. Group Delay Features for Speaker Recognition. In 6th International Conference on Information, Communications&Signal Processing, pages 1--5, 2007.Google ScholarGoogle Scholar
  60. S. E. Tranter and D. A. Reynolds. An Overview of Automatic Speaker Diarization Systems. IEEE Transactions on Audio, Speech, and Language Processing, 14:1557--1565, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. W.-H. Tsai, S.-S. Chen, and H.-M. Wang. Automatic Speaker Clustering using a Voice Characteristic Reference Space and Maximum Purity Estimation. IEEE Transactions on Audio, Speech, and Language Processing, 15:1461--1474, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. D. A. van Leeuwen, A. F. Martin, M. A. Przybocki, and J. S. Bouten. NIST and NFI-TNO Evaluations of Automatic Speaker Recognition. Computer Speech and Language, 20:128--158, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  63. M. Vlachos, G. Kollios, and D. Gunopulos. Discovering Similar Multidimensional Trajectories. In Proceedings of ICDE'02, pages 673--684, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. D. Wu. Discriminative Preprocessing of Speech: Towards Improving Biometric Authentication. PhD thesis, Saarland University, 2006.Google ScholarGoogle Scholar
  65. D. Wu, J. Li, and H. Wu. α-Gaussian Mixture Modelling for Speaker Recognition. Pattern Recognition Letters, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. S. Zhang, W. Hu, T. Wang, J. Liu, and Y. Zhang. Speaker Clustering Aided by Visual Dialogue Analysis. In PCM 2008, Lecture Notes on Computer Science, volume 5353, pages 693--702, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Unfolding speaker clustering potential: a biomimetic approach

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '09: Proceedings of the 17th ACM international conference on Multimedia
          October 2009
          1202 pages
          ISBN:9781605586083
          DOI:10.1145/1631272

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 October 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader