research-article

Unfolding speaker clustering potential: a biomimetic approach

Authors:
Thilo Stadelmann

University of Marburg, Marburg, Germany

University of Marburg, Marburg, Germany
View Profile

,
Bernd Freisleben

University of Marburg, Marburg, Germany

University of Marburg, Marburg, Germany
View Profile

MM '09: Proceedings of the 17th ACM international conference on MultimediaOctober 2009Pages 185–194https://doi.org/10.1145/1631272.1631300

Published:19 October 2009Publication History

MM '09: Proceedings of the 17th ACM international conference on Multimedia

Pages 185–194

ABSTRACT

Speaker clustering is the task of grouping a set of speech utterances into speaker-specific classes. The basic techniques for solving this task are similar to those used for speaker verification and identification. The hypothesis of this paper is that the techniques originally developed for speaker verification and identification are not sufficiently discriminative for speaker clustering. However, the processing chain for speaker clustering is quite large - there are many potential areas for improvement. The question is: where should improvements be made to improve the final result? To answer this question, this paper takes a biomimetic approach based on a study with human participants acting as an automatic speaker clustering system. Our findings are twofold: it is the stage of modeling that has the highest potential, and information with respect to the temporal succession of frames is crucially missing. Experimental results with our implementation of a speaker clustering system incorporating our findings and applying it on TIMIT data show the validity of our approach.

References

A. G. Adami. Modeling Prosodic Di erences for Speaker Recognition. Speech Communication, 49:277--291, 2007. Google ScholarDigital Library
J.-J. Aucouturier. A Day in the Life of a Gaussian Mixture Model: Informing Music Pattern Recognition with Psychological Experiments. Journal of New Music Research, submitted, 2009.Google Scholar
J.-J. Aucouturier and F. Pachet. Improving Timbre Similarity: How high is the sky? Journal of Negative Results in Speech and Audio Sciences, 1(1), 2004.Google Scholar
Y. Bar-Cohen. Biomimetics: Biologically Inspired Technologies. CRC Press, Boca Raton, FL, USA, 2006.Google Scholar
H. Beigi, S. Maes, and J. Sorensen. A Distance Measure Between Collections of Distributions and its Application to Speaker Recognition. In IEEE Proc. of ICASSP, volume 2, pages 753--756, 1998.Google ScholarCross Ref
J. Benesty, M. M. Sondhi, and Y. Huang. Springer Handbook of Speech Processing. Springer, Germany, 2008. Google ScholarDigital Library
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006. Google ScholarDigital Library
J. P. Campbell. Speaker Recognition: A Tutorial. Proceedings of the IEEE, 85:1437--1462, 1997.Google ScholarCross Ref
C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
Z.-H. Chen, Y.-F. Liao, and Y.-T. Juang. Prosody Modeling and Eigen-Prosody Analysis for Robust Speaker Recognition. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'05, pages I-185-I-188, 2005.Google Scholar
R. Chengalvarayan and L. Deng. Speech Trajectory Discrimination Using the Minimum Classification Error Learning. IEEE Transactions on Speech and Audio Processing, 6(6), 1998.Google ScholarCross Ref
S. Davis and P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28:357--366, 1980.Google ScholarCross Ref
K. Demuynck, O. Garcia, and D. V. Compernolle. Synthesizing Speech from Speech Recognition Parameters. In Proc. International Conference on Spoken Language Processing, Jeju Island, Korea, volume II, pages 945--948, 2004.Google Scholar
F. Desobry, M. Davy, and C. Doncarli. An Online Kernel Change Detection Algorithm. IEEE Transactions on Signal Processing, 53(8), 2005. Google ScholarDigital Library
F. Desobry, M. Davy, and W. J. Fitzgerald. A Class of Kernels for Sets of Vectors. In Proceedings of ESANN'2005, pages 461--466. MIT Press, 2005.Google Scholar
M. Faundez-Zanuy and E. Monte-Moreno. State-of-the-Art in Speaker Recognition. IEEE Aerospace and Electronic Systems Magazine, 20:7--12, 2005.Google ScholarCross Ref
B. Fergani, M. Davy, and A. Houacine. Speaker Diarization using One-Class Support Vector Machines. Speech Communication, 50:355--365, 2008. Google ScholarDigital Library
L. Ferrer, H. Bratt, V. R. R. Gadde, S. Kajarekar, E. Shriberg, K. Sonmez, A. Stolcke, and A. Venkataraman. Modeling Duration Patterns for Speaker Recognition. In Proceedings of EUROSPEECH, pages 2017--2020, 2003.Google Scholar
W. M. Fisher, G. R. Doddington, and K. M. Goudie-Marshall. The DARPA Speech Recognition Research Database: Specification and Status. In Proceedings of the DARPA Speech Recognition Workshop, Report No. SAIC-86/1546, February 1986, Palo-Alto, 1986.Google Scholar
S. Furui. 50 Years of Progress in Speech and Speaker Recognition. In Proc. SPECOM 2005, Patras, Greece, pages 1--9, 2005.Google Scholar
B. Goertzel and C. Pennachin. Artificial General Intelligence. Springer, Berlin, Heidelberg, Germany, 2007. Google ScholarDigital Library
D. W. Griffin and J. S. Lim. Signal Estimation from Modified Short-Time Fourier Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32:236--243, 1984.Google ScholarCross Ref
K. J. Han, S. Kim, and S. S. Narayanan. Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization. IEEE Transactions on Audio, Speech, and Language Processing, 16:1590--1601, 2008. Google ScholarDigital Library
H. Jin, F. Kubala, and R. Schwartz. Automatic Speaker Clustering. In Proc. of the DARPA Speech Recognition Workshop, pages 108--111, 1997.Google Scholar
C. Joder, S. Essid, and G. Richard. Temporal Integration for Audio Classification With Application to Musical Instrument Classification. IEEE Transactions on Audio, Speech, and Language Processing, 17:174--186, 2009. Google ScholarDigital Library
D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd Edn. Addison Wesley, 1998. Google ScholarDigital Library
M. Kotti, E. Benetos, and C. Kotropoulos. Computationally Efficient and Robust BIC-Based Speaker Segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 16:920--933, 2008. Google ScholarDigital Library
M. Kotti, V. Moschou, and C. Kotropoulos. Speaker Segmentation and Clustering. Signal Processing, 88:1091--1124, 2008. Google ScholarDigital Library
H.-J. Z. Lie Lu. Unsupervised Speaker Segmentation and Tracking in Real-Time Audio Content Analysis. Multimedia Systems, 10:332--343, 2005.Google Scholar
B. Lindblom, R. Diehl, and C. Creeger. Do 'Dominant Frequencies' Explain the Listener's Response to Formant and Spectrum Shape Variations? Speech Communication, 2008. Google ScholarDigital Library
J. Makhoul, F. Kubala, T. Leek, D. Liu, L. Nguyen, R. Schwartz, and A. Srivastava. Speech and Language Technologies for Audio Indexing and Retrieval. Proceedings of the IEEE, 88:1338--1353, 2000.Google ScholarCross Ref
A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, and S. Pillay. Discrimination E ectiveness of Speech Cepstral Features. Lecture Notes in Computer Science, 5372:91--99, 2008. Google ScholarDigital Library
L. Mary and B. Yegnanarayana. Extraction and Representation of Prosodic Features. Speech Communication, 2008. Google ScholarDigital Library
S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, and L. Besacier. Step-by-Step and Integrated Approaches in Broadcast News Speaker Diarization. Computer Speech and Language, 20:303--330, 2006.Google ScholarCross Ref
B. Milner and X. Shao. Speech Reconstruction from Mel-Frequency Cepstral Coefficients using a Source-Filter Model. In International Conference on Spoken Language Processing (ICSLP), pages 2421--2424, 2002.Google Scholar
B. Milner and X. Shao. Clean Speech Reconstruction from MFCC Vectors and Fundamental Frequency using an Integrated Front-End. Speech Communication, 48:697--715, 2006.Google ScholarCross Ref
T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997. Google ScholarDigital Library
B. C. J. Moore. Psychology of Hearing, Fifth Edition. Elsevier Academic Press, London, UK, 2004.Google Scholar
A. Morris, D. Wu, and J. Koreman. GMM based Clustering and Speaker Separability in the TIMIT Speech Database. Technical Report Saar-IP-08-08-2004, Saarland University, 2004.Google Scholar
F. Pachet and P. Roy. Exploring Billions of Audio Features. In Eurasip, editor, Proceedings of CBMI 07, pages 227--235, 2007.Google Scholar
S. M. Prasanna, C. S. Gupta, and B. Yegnanarayana. Extraction of Speaker-Specific Excitation Information from Linear Prediction Residual of Speech. Speech Communication, 48:1243--1261, 2006.Google ScholarCross Ref
M. Przybocki and A. Martin. NIST Speaker Recognition Evaluation Chronicles. In Proceedings in Odyssey 2004, 2004.Google Scholar
D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and B. Xiang. The SuperSID Project: Exploiting High-Level Information for High-Accuracy Speaker Recognition. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'03, pages IV-784-IV-787, 2003.Google Scholar
D. Reynolds, W. Campbell, T. Gleason, C. Quillen, D. Sturim, P. Torres-Carrasquillo, and A. Adami. The 2004 MIT Lincoln Laboratory Speaker Recognition System. In Proc. IEEE Int. Conf. Acoust. Speech&Signal Proc. ICASSP'05, pages I-177-I-180, 2005.Google Scholar
D. Reynolds and P. Torres-Carrasquillo. The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast News and Telephone Conversations. In NIST Rich Transcription Workshop November 2004, 2004.Google Scholar
D. Reynolds and P. Torres-Carrasquillo. Approaches and Applications of Audio Diarization. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing 2005, volume 5, pages V-953--V-956, 2005.Google ScholarCross Ref
D. A. Reynolds. Speaker Identification and Verification using Gaussian Mixture Speaker Models. Speech Communication, 17:91--108, 1995. Google ScholarDigital Library
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, 10:19--41, 2000.Google ScholarDigital Library
D. A. Reynolds and R. C. Rose. Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, 3:72--83, 1995.Google ScholarCross Ref
P. Rose. Forensic Speaker Identification. Taylor&Francis, London and New York, 2002.Google Scholar
L. Saul and M. Rahim. Markov Processes on Curves for Automatic Speech Recognition. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 751--757. MIT Press, 1999. Google ScholarDigital Library
B. Schouten, M. Tistarelli, C. Garcia-Mateo, F. Deravi, and M. Meints. Nineteen Urgent Research Topics in Biometrics and Identity Management. Lecture Notes in Computer Science, 5372:228--235, 2008. Google ScholarDigital Library
C. C. Sekhar and M. Panaliswami. Classification of Multidimensional Trajectories for Acoustic Modeling Using Support Vector Machines. In Proceedings of ICISIP'04, pages 153--158, 2004.Google ScholarCross Ref
S. W. Smith. Digital Signal Processing - A Practical Guide for Engineers and Scientists. Newnes, USA, 2003.Google Scholar
M. K. Soenmez, L. Heck, M. Weintraub, and E. Shriberg. A Lognormal Tied Mixture Model of Pitch for Prosody-Based Speaker Recognition. In Proceedings of EUROSPEECH, pages 1391--1394, 1997.Google Scholar
T. Su and J. G. Dy. In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering. Intelligent Data Analysis, 11:319--338, 2007. Google ScholarDigital Library
D. Talkin. A Robust Algorithm for Pitch Tracking (RAPT). In W. B. Klejin and K. K. Paliwal, editors, Speech Coding and Synthesis, chapter 3, pages 495--518. Elsevier Science, Amsterdam, NL, 1995.Google Scholar
D. M. J. Tax. One-Class Classification - Concept-Learning in the Absence of Counter-Examples. PhD thesis, Technische Universteit Delft, 2001.Google Scholar
T. Thiruvaran, E. Ambikairajah, and J. Epps. Group Delay Features for Speaker Recognition. In 6th International Conference on Information, Communications&Signal Processing, pages 1--5, 2007.Google Scholar
S. E. Tranter and D. A. Reynolds. An Overview of Automatic Speaker Diarization Systems. IEEE Transactions on Audio, Speech, and Language Processing, 14:1557--1565, 2006. Google ScholarDigital Library
W.-H. Tsai, S.-S. Chen, and H.-M. Wang. Automatic Speaker Clustering using a Voice Characteristic Reference Space and Maximum Purity Estimation. IEEE Transactions on Audio, Speech, and Language Processing, 15:1461--1474, 2007. Google ScholarDigital Library
D. A. van Leeuwen, A. F. Martin, M. A. Przybocki, and J. S. Bouten. NIST and NFI-TNO Evaluations of Automatic Speaker Recognition. Computer Speech and Language, 20:128--158, 2006.Google ScholarCross Ref
M. Vlachos, G. Kollios, and D. Gunopulos. Discovering Similar Multidimensional Trajectories. In Proceedings of ICDE'02, pages 673--684, 2002. Google ScholarDigital Library
D. Wu. Discriminative Preprocessing of Speech: Towards Improving Biometric Authentication. PhD thesis, Saarland University, 2006.Google Scholar
D. Wu, J. Li, and H. Wu. α-Gaussian Mixture Modelling for Speaker Recognition. Pattern Recognition Letters, 2009. Google ScholarDigital Library
S. Zhang, W. Hu, T. Wang, J. Liu, and Y. Zhang. Speaker Clustering Aided by Visual Dialogue Analysis. In PCM 2008, Lecture Notes on Computer Science, volume 5353, pages 693--702, 2008. Google ScholarDigital Library

Index Terms

Unfolding speaker clustering potential: a biomimetic approach
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems

Recommendations

Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style'...
Read More
Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients

There are several known feature sets for text-independent speaker-identification systems, most of which depend on spectral information. Among these feature sets as a most successful one, there is the set of the Mel-Frequency Cepstrum Coefficients MFCC. ...
Read More
Text-Independent speaker identification in phoneme-independent subspace using PCA transformation

In this paper we proposed a text-independent (TI) speaker identification method that suppresses the phonetic information by a subspace method, under the assumption that a subspace with large variance in the speech feature space is a 'phoneme-dependent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '09: Proceedings of the 17th ACM international conference on Multimedia
October 2009
1202 pages
ISBN:9781605586083
DOI:10.1145/1631272
General Chairs:
Wen Gao
Peking University, China
,
Yong Rui
Microsoft, China
,
Alan Hanjalic
Delft University of Technology, The Netherlands
,
Program Chairs:
Changsheng Xu
Institute of Automation, Chinese Academy of Sciences, China
,
Eckehard Steinbach
Technical University of Munich, Germany
,
Abdulmotaleb El Saddik
University of Ottawa, Canada
,
Michelle Zhou
IBM T. J. Watson Research Center, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GMM
MFCC
one-class SVM
speaker clustering
speaker diarization
speaker identification
temporal context
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 262
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unfolding speaker clustering potential: a biomimetic approach

MM '09: Proceedings of the 17th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients

Text-Independent speaker identification in phoneme-independent subspace using PCA transformation