Skip to main content
Erschienen in: International Journal of Multimedia Information Retrieval 3/2014

01.09.2014 | Regular Paper

Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast

verfasst von: Hervé Bredin, Anindya Roy, Viet-Bac Le, Claude Barras

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 3/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This work introduces a unified framework for mono-, cross- and multi-modal person recognition in multimedia data. Dubbed person instance graph models the person recognition task as a graph mining problem: i.e., finding the best mapping between person instance vertices and identity vertices. Practically, we describe how the approach can be applied to speaker identification in TV broadcast. Then, a solution to the above-mentioned mapping problem is proposed. It relies on integer linear programming to model the problem of clustering person instances based on their identity. We provide an in-depth theoretical definition of the optimization problem. Moreover, we improve two fundamental aspects of our previous related work: the problem constraints and the optimized objective function. Finally, a thorough experimental evaluation of the proposed framework is performed on a publicly available benchmark database. Depending on the graph configuration (i.e., the choice of its vertices and edges), we show that multiple tasks can be addressed interchangeably (e.g., speaker diarization, supervised or unsupervised speaker identification), significantly outperforming state-of-the-art mono-modal approaches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Barras C, Zhu X, Meignier S, Gauvain JL (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512CrossRef Barras C, Zhu X, Meignier S, Gauvain JL (2006) Multi-stage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512CrossRef
2.
Zurück zum Zitat Bäuml M, Tapaswi M, Stiefelhagen R (2013) Semi-supervised learning with constraints for person identification in multimedia data. In: International conference on computer vision and pattern recognition (CVPR) Bäuml M, Tapaswi M, Stiefelhagen R (2013) Semi-supervised learning with constraints for person identification in multimedia data. In: International conference on computer vision and pattern recognition (CVPR)
3.
Zurück zum Zitat Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305MATHMathSciNet Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305MATHMathSciNet
5.
Zurück zum Zitat Bredin H, Chollet G (2007) Audio-visual speech synchrony measure: application to biometrics. EURASIP J Adv Signal Process 2007(1):070186. doi:10.1155/2007/70186 Bredin H, Chollet G (2007) Audio-visual speech synchrony measure: application to biometrics. EURASIP J Adv Signal Process 2007(1):070186. doi:10.​1155/​2007/​70186
6.
Zurück zum Zitat Bredin H, Poignant J (2013) Integer linear programming for speaker diarization and cross-modal identification in TV broadcast. In: Interspeech 2013, 14th annual conference of the International Speech Communication Association, Lyon Bredin H, Poignant J (2013) Integer linear programming for speaker diarization and cross-modal identification in TV broadcast. In: Interspeech 2013, 14th annual conference of the International Speech Communication Association, Lyon
7.
Zurück zum Zitat Canseco L, Lamel L, Gauvain JL (2005) A comparative study using manual and automatic transcriptions for diarization. In: Proceedings of the IEEE automatic speech recognition and understanding, workshop, pp 415–419 Canseco L, Lamel L, Gauvain JL (2005) A comparative study using manual and automatic transcriptions for diarization. In: Proceedings of the IEEE automatic speech recognition and understanding, workshop, pp 415–419
8.
Zurück zum Zitat Chen SS, Gopalakrishnan P (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA broadcast news transcription and understanding workshop. Virginia Chen SS, Gopalakrishnan P (1998) Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: DARPA broadcast news transcription and understanding workshop. Virginia
9.
Zurück zum Zitat Cour T, Sapp B, Nagle A, Taskar B (2010) Talking pictures: temporal grouping and dialog-supervised person recognition. In: International conference on computer vision and pattern recognition (CVPR) Cour T, Sapp B, Nagle A, Taskar B (2010) Talking pictures: temporal grouping and dialog-supervised person recognition. In: International conference on computer vision and pattern recognition (CVPR)
10.
Zurück zum Zitat Dimitrova N, Zhang HJ, Shahraray B, Sezan I, Huang T, Zakhor A (2002) Applications of video-content analysis and retrieval. IEEE Multimed 9(3):42–55CrossRef Dimitrova N, Zhang HJ, Shahraray B, Sezan I, Huang T, Zakhor A (2002) Applications of video-content analysis and retrieval. IEEE Multimed 9(3):42–55CrossRef
11.
Zurück zum Zitat Dinarelli M, Rosset S (2011) Models cascade for tree-structured named entity detection. In: Proceedings of 5th international joint conference on natural language processing, Asian Federation of Natural Language processing, Chiang Mai, pp 1269–1278 Dinarelli M, Rosset S (2011) Models cascade for tree-structured named entity detection. In: Proceedings of 5th international joint conference on natural language processing, Asian Federation of Natural Language processing, Chiang Mai, pp 1269–1278
12.
Zurück zum Zitat Dupuy G, Rouvier M, Meignier S, Estève Y (2012) i-Vectors and ILP clustering adapted to cross-show speaker diarization. In: Interspeech 2012, 13th annual conference of the International Speech Communication Association Dupuy G, Rouvier M, Meignier S, Estève Y (2012) i-Vectors and ILP clustering adapted to cross-show speaker diarization. In: Interspeech 2012, 13th annual conference of the International Speech Communication Association
13.
Zurück zum Zitat Estève Y, Meignier S, Deléglise, P, Mauclair J (2007) Extracting true speaker identities from transcriptions. In: Proceedings of interspeech, pp 2601–2604 Estève Y, Meignier S, Deléglise, P, Mauclair J (2007) Extracting true speaker identities from transcriptions. In: Proceedings of interspeech, pp 2601–2604
14.
Zurück zum Zitat Finkel JR, Manning CD (2008) Enforcing transitivity in coreference resolution. In: Annual meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT) Finkel JR, Manning CD (2008) Enforcing transitivity in coreference resolution. In: Annual meeting of the Association for Computational Linguistics: Human Language Technologies (ACL HLT)
15.
Zurück zum Zitat Fiscus JG, Garofolo, JS, Le, AN, Martin, AF, Pallett D, Przybocki MA, Sanders GA (2004) Results of the Fall 2004 STT and MDE evaluation. In: Fall 2004 rich transcription workshop (RT-04). Palisades Fiscus JG, Garofolo, JS, Le, AN, Martin, AF, Pallett D, Przybocki MA, Sanders GA (2004) Results of the Fall 2004 STT and MDE evaluation. In: Fall 2004 rich transcription workshop (RT-04). Palisades
16.
Zurück zum Zitat Gauvain JL, Lamel L, Adda G (1998) Partitioning and transcription of broadcast news data. In: Proceedings of international conference on spoken language processing (ICSLP 98), Sydney, pp 1335–1338 Gauvain JL, Lamel L, Adda G (1998) Partitioning and transcription of broadcast news data. In: Proceedings of international conference on spoken language processing (ICSLP 98), Sydney, pp 1335–1338
17.
Zurück zum Zitat Gauvain JL, Lamel L, Adda G (2002) The limsi broadcast news transcription system. Speech Commun 37(1–2):89–109CrossRefMATH Gauvain JL, Lamel L, Adda G (2002) The limsi broadcast news transcription system. Speech Commun 37(1–2):89–109CrossRefMATH
18.
Zurück zum Zitat Gauvain JL, Lee CH (1994) Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans Speech Audio Process 2(2):291–298CrossRef Gauvain JL, Lee CH (1994) Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans Speech Audio Process 2(2):291–298CrossRef
19.
Zurück zum Zitat Giraudel A, Carré M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus: a multimodal corpus for person recognition. In: International conference on language resources and evaluation (LREC) Giraudel A, Carré M, Mapelli V, Kahn J, Galibert O, Quintard L (2012) The REPERE corpus: a multimodal corpus for person recognition. In: International conference on language resources and evaluation (LREC)
20.
Zurück zum Zitat Gravier G, Adda G, Paulson N, Carré M, Giraudel A, Galibert O (2012) The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: International conference on language resources, evaluation and corpora, Turkey Gravier G, Adda G, Paulson N, Carré M, Giraudel A, Galibert O (2012) The ETAPE corpus for the evaluation of speech-based TV content processing in the French language. In: International conference on language resources, evaluation and corpora, Turkey
23.
Zurück zum Zitat Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
24.
Zurück zum Zitat Jousse V, Petitrenaud S, Meignier S, Estève Y, Jacquin C (2009) Automatic named identification of speakers using diarization and ASR systems. In: ICASSP 2009, IEEE international conference on acoustics, speech, and signal processing, Taïpei Jousse V, Petitrenaud S, Meignier S, Estève Y, Jacquin C (2009) Automatic named identification of speakers using diarization and ASR systems. In: ICASSP 2009, IEEE international conference on acoustics, speech, and signal processing, Taïpei
25.
Zurück zum Zitat Lawto J, Gauvain JL, Lamel L, Grefenstette G, Gravier G, Despres J, Guinaudeau C, Sebillot P (2011) A scalable video search engine based on audio content indexing and topic segmentation. In: Networked and electronic media (NEM) summit : implementing future media internet Lawto J, Gauvain JL, Lamel L, Grefenstette G, Gravier G, Despres J, Guinaudeau C, Sebillot P (2011) A scalable video search engine based on audio content indexing and topic segmentation. In: Networked and electronic media (NEM) summit : implementing future media internet
26.
Zurück zum Zitat Le VB, Barras C, Ferras M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Proceedings of Odyssey 2010—the speaker and language recognition workshop, Brno, pp 146–150 Le VB, Barras C, Ferras M (2010) On the use of GSV-SVM for speaker diarization and tracking. In: Proceedings of Odyssey 2010—the speaker and language recognition workshop, Brno, pp 146–150
27.
Zurück zum Zitat Long B, Zhang MZ, Yu PS, Tianbing X (2008) Clustering on complex graphs. In: Proceedings of the twenty-third AAAI conference on artificial intelligence Long B, Zhang MZ, Yu PS, Tianbing X (2008) Clustering on complex graphs. In: Proceedings of the twenty-third AAAI conference on artificial intelligence
28.
Zurück zum Zitat Mauclair J, Meignier S, Estève Y (2006) Speaker diarization: about whom the speaker is talking? In: IEEE Odyssey Mauclair J, Meignier S, Estève Y (2006) Speaker diarization: about whom the speaker is talking? In: IEEE Odyssey
29.
Zurück zum Zitat Mouysset S, Noailles J, Ruiz D, Guivarch R (2011) On a strategy for spectral clustering with parallel computation. High Perform Comput Comput Sci VECPAR 2010:408–420 Mouysset S, Noailles J, Ruiz D, Guivarch R (2011) On a strategy for spectral clustering with parallel computation. High Perform Comput Comput Sci VECPAR 2010:408–420
30.
Zurück zum Zitat Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582CrossRef Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582CrossRef
31.
Zurück zum Zitat Pan JY, Yang HJ, Faloutsos C (2004) MMSS: Multi-modal story-oriented video summarization. In: Proceedings of the fourth IEEE international conference on data mining (ICDM) Pan JY, Yang HJ, Faloutsos C (2004) MMSS: Multi-modal story-oriented video summarization. In: Proceedings of the fourth IEEE international conference on data mining (ICDM)
32.
Zurück zum Zitat Pan JY, Yang HJ, Faloutsos C, Duygulu P (2004) Automatic multimedia cross-modal correlation discovery. In: Proceedings of the 10th ACM SIGKDD conference Pan JY, Yang HJ, Faloutsos C, Duygulu P (2004) Automatic multimedia cross-modal correlation discovery. In: Proceedings of the 10th ACM SIGKDD conference
33.
Zurück zum Zitat Pelecanos J, Sridharan S (2001) Feature warping for robust speaker verification. In: Proceedings of Odyssey 2001—the speaker recognition workshop, Crete, pp 213–218 Pelecanos J, Sridharan S (2001) Feature warping for robust speaker verification. In: Proceedings of Odyssey 2001—the speaker recognition workshop, Crete, pp 213–218
34.
Zurück zum Zitat Pelleg D, Moore AW (2000) X-means: extending K-means with efficient estimation of the number of clusters. Proceedings of the seventeenth international conference on machine learning, ICML ’00Morgan Kaufmann Publishers Inc., San Francisco, pp 727–734 Pelleg D, Moore AW (2000) X-means: extending K-means with efficient estimation of the number of clusters. Proceedings of the seventeenth international conference on machine learning, ICML ’00Morgan Kaufmann Publishers Inc., San Francisco, pp 727–734
35.
Zurück zum Zitat Poignant J, Besacier L, Le VB, Rosset S, Quénot G (2013) Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both? In: Interspeech 2013, 14th annual conference of the International Speech Communication Association, Lyon Poignant J, Besacier L, Le VB, Rosset S, Quénot G (2013) Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both? In: Interspeech 2013, 14th annual conference of the International Speech Communication Association, Lyon
36.
Zurück zum Zitat Poignant J, Besacier L, Quénot G, Thollard F (2012) From text detection in videos to person identification. In: International conference on multimedia and expo (ICME) Poignant J, Besacier L, Quénot G, Thollard F (2012) From text detection in videos to person identification. In: International conference on multimedia and expo (ICME)
37.
Zurück zum Zitat Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quénot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: Interspeech 2012, 13th annual conference of the International Speech Communication Association, Portland Poignant J, Bredin H, Le VB, Besacier L, Barras C, Quénot G (2012) Unsupervised speaker identification using overlaid texts in TV broadcast. In: Interspeech 2012, 13th annual conference of the International Speech Communication Association, Portland
38.
Zurück zum Zitat Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Digit Signal Process 10(1–3):19–41CrossRef Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Digit Signal Process 10(1–3):19–41CrossRef
39.
Zurück zum Zitat Smeulders A, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380CrossRef Smeulders A, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380CrossRef
40.
Zurück zum Zitat Smith R (2007) An overview of the tesseract OCR engine. In: Proceedings of the ninth international conference on document analysis and recognition, vol 02, ICDAR ’07IEEE Computer Society, Washington, DC, pp 629–633 Smith R (2007) An overview of the tesseract OCR engine. In: Proceedings of the ninth international conference on document analysis and recognition, vol 02, ICDAR ’07IEEE Computer Society, Washington, DC, pp 629–633
41.
Zurück zum Zitat Tranter SE (2006) Who really spoke when? Finding speaker turns and identities in broadcast news audio. In: Proceedings of the ICASSP, pp 1013–1016 Tranter SE (2006) Who really spoke when? Finding speaker turns and identities in broadcast news audio. In: Proceedings of the ICASSP, pp 1013–1016
42.
Zurück zum Zitat Wang Y, Liu Z, Huang JC (2000) Multimedia content analysis-using both audio and visual clues. IEEE Signal Process Mag 17(6):12–36CrossRef Wang Y, Liu Z, Huang JC (2000) Multimedia content analysis-using both audio and visual clues. IEEE Signal Process Mag 17(6):12–36CrossRef
Metadaten
Titel
Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast
verfasst von
Hervé Bredin
Anindya Roy
Viet-Bac Le
Claude Barras
Publikationsdatum
01.09.2014
Verlag
Springer London
Erschienen in
International Journal of Multimedia Information Retrieval / Ausgabe 3/2014
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-014-0055-y

Weitere Artikel der Ausgabe 3/2014

International Journal of Multimedia Information Retrieval 3/2014 Zur Ausgabe

Premium Partner