Skip to main content
Erschienen in: Machine Vision and Applications 1/2014

01.01.2014 | Special Issue Paper

Discovering joint audio–visual codewords for video event detection

verfasst von: I-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, Shih-Fu Chang

Erschienen in: Machine Vision and Applications | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Detecting complex events in videos is intrinsically a multimodal problem since both audio and visual channels provide important clues. While conventional methods fuse both modalities at a superficial level, in this paper we propose a new representation—called bi-modal words—to explore representative joint audio–visual patterns. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to produce the bi-modal words that reveal the joint patterns across modalities. Different pooling strategies are then employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations. Since it is difficult to predict the suitable number of bi-modal words, we generate bi-modal words at different levels (i.e., codebooks with different sizes), and use multiple kernel learning to combine the resulting multiple representations during event classifier learning. Experimental results on three popular datasets show that the proposed method achieves statistically significant performance gains over methods using individual visual and audio feature alone and existing popular multi-modal fusion methods. We also find that average pooling is particularly suitable for bi-modal representation, and using multiple kernel learning to combine multi-modal representations at various granularities is helpful.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Normally event detection is performed on video level, i.e., to detect whether a video contains an event of interest. Therefore, we represent each video by a feature vector.
 
Literatur
4.
Zurück zum Zitat Bao, L., et al.: Informedia @ TRECVID 2011. In: TRECVID Workshop (2011) Bao, L., et al.: Informedia @ TRECVID 2011. In: TRECVID Workshop (2011)
5.
Zurück zum Zitat Beal, M., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 828–836 (2003)CrossRef Beal, M., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 828–836 (2003)CrossRef
6.
Zurück zum Zitat Boureau, Y.-L., Ponce, J., Lecun, Y.: A theoretical analysis of feature pooling in visual recognition. In: International Conference on Machine Learning (2010) Boureau, Y.-L., Ponce, J., Lecun, Y.: A theoretical analysis of feature pooling in visual recognition. In: International Conference on Machine Learning (2010)
7.
Zurück zum Zitat Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (2004) Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (2004)
8.
Zurück zum Zitat Cristani, M., Bicego, M., Murino, V.: Audio-visual event recognition in surveillance video sequences. IEEE Trans. Multimedia (2007) Cristani, M., Bicego, M., Murino, V.: Audio-visual event recognition in surveillance video sequences. IEEE Trans. Multimedia (2007)
9.
Zurück zum Zitat Dhillon, I.: Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM Conference on Knowledge Discovery and Data Mining (2001) Dhillon, I.: Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM Conference on Knowledge Discovery and Data Mining (2001)
10.
Zurück zum Zitat Gehler, P., Nowozin, S.: On feature combination for multiclass object detection. In: IEEE International Conference on Computer Vision (2009) Gehler, P., Nowozin, S.: On feature combination for multiclass object detection. In: IEEE International Conference on Computer Vision (2009)
11.
Zurück zum Zitat Jhuo, I.H., Lee, D.-T.: Boosting-based Multiple Kernel Learning forImage Re-ranking. In: ACM International Conference on Multimedia (2010) Jhuo, I.H., Lee, D.-T.: Boosting-based Multiple Kernel Learning forImage Re-ranking. In: ACM International Conference on Multimedia (2010)
12.
Zurück zum Zitat Jiang, W., Cotton, C., Chang, S.-F., Ellis, D., Loui, A.: Short-term audio-visual atoms for generic video concept classification. In: ACM International Conference on Multimedia (2009) Jiang, W., Cotton, C., Chang, S.-F., Ellis, D., Loui, A.: Short-term audio-visual atoms for generic video concept classification. In: ACM International Conference on Multimedia (2009)
13.
Zurück zum Zitat Jiang, W., Loui, A.: Audio-visual grouplet: Temporal audio-visual interactions for general video concept classification. In: ACM International Conference on Multimedia (2011) Jiang, W., Loui, A.: Audio-visual grouplet: Temporal audio-visual interactions for general video concept classification. In: ACM International Conference on Multimedia (2011)
14.
Zurück zum Zitat Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D., Loui, A.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: ACM International Conference on Multimedia Retrieval (2011) Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D., Loui, A.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: ACM International Conference on Multimedia Retrieval (2011)
15.
Zurück zum Zitat Jiang, Y.-G., et al.: Columbia-ucf trecvid2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In: NIST TRECVID Workshop (2010) Jiang, Y.-G., et al.: Columbia-ucf trecvid2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In: NIST TRECVID Workshop (2010)
16.
Zurück zum Zitat Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., Shah, M.: High-level event recognition in unconstrained videos. In: International Journal of Multimedia Information Retrieval, Vol. 2(2), pp. 73–101 (2012) Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., Shah, M.: High-level event recognition in unconstrained videos. In: International Journal of Multimedia Information Retrieval, Vol. 2(2), pp. 73–101 (2012)
17.
Zurück zum Zitat Kembhavi, A., Siddiquie, B., Miezianko, R., McCloskey, S., Davis, L.S.: Incremental multiple kernel learning for object recognition. In: IEEE International Conference on Computer Vision (2009) Kembhavi, A., Siddiquie, B., Miezianko, R., McCloskey, S., Davis, L.S.: Incremental multiple kernel learning for object recognition. In: IEEE International Conference on Computer Vision (2009)
18.
Zurück zum Zitat Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRef Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRef
19.
Zurück zum Zitat Laptev, I., Lindeberg, T.: On space-time interest points. Int. J. Comput. Vision 64(2), 107–123 (2005)CrossRef Laptev, I., Lindeberg, T.: On space-time interest points. Int. J. Comput. Vision 64(2), 107–123 (2005)CrossRef
20.
Zurück zum Zitat Laptev, I., Marszlek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. IEEE Conf. Comput. Vision Pattern Recognit. 60(1), 63–86 (2008) Laptev, I., Marszlek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. IEEE Conf. Comput. Vision Pattern Recognit. 60(1), 63–86 (2008)
21.
Zurück zum Zitat Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view knowledge transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (2011) Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view knowledge transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)
22.
Zurück zum Zitat Lutkepohl, H.: Handbook of Matrices. Wiley, Chichester (1997) Lutkepohl, H.: Handbook of Matrices. Wiley, Chichester (1997)
23.
Zurück zum Zitat Manning, C., Raghavan, P., Schtze, H.: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH Manning, C., Raghavan, P., Schtze, H.: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH
24.
Zurück zum Zitat Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. Int. J. Comput. Vision 60(1), 63–86 (2004)CrossRef Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. Int. J. Comput. Vision 60(1), 63–86 (2004)CrossRef
25.
Zurück zum Zitat Natarajan, P., et al.: BBN VISER TRECVID 2011 multimedia event detection system. In: NIST TRECVID Workshop (2011) Natarajan, P., et al.: BBN VISER TRECVID 2011 multimedia event detection system. In: NIST TRECVID Workshop (2011)
26.
Zurück zum Zitat Pan, S., Nu, X., Sun, J.T., Yang, Q., Chen, Z.: Co-clustering documents and words using bipartite spectral graph partitioning. In: International World Wide Web Conference (2010) Pan, S., Nu, X., Sun, J.T., Yang, Q., Chen, Z.: Co-clustering documents and words using bipartite spectral graph partitioning. In: International World Wide Web Conference (2010)
27.
Zurück zum Zitat Pols, L.: Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Free University, Amsterdam (1966) Pols, L.: Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Free University, Amsterdam (1966)
28.
Zurück zum Zitat Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: an overview. In: Issues in visual and audio-visual speech processing (2004) Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: an overview. In: Issues in visual and audio-visual speech processing (2004)
29.
Zurück zum Zitat Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: SimpleMKL. J. Mach. Learn. Res. 9, 2491–2512 (2009)MathSciNet Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: SimpleMKL. J. Mach. Learn. Res. 9, 2491–2512 (2009)MathSciNet
30.
Zurück zum Zitat Vedaldi, A., Gulshan, V., Varma, M., Zisserman. A.: Multiple kernels for object detection. In: IEEE International Conference on Computer Vision (2009) Vedaldi, A., Gulshan, V., Varma, M., Zisserman. A.: Multiple kernels for object detection. In: IEEE International Conference on Computer Vision (2009)
31.
Zurück zum Zitat Wang, J.-C., Yang, Y.-H., Jhuo, I.-H., Lin, Y.-Y., Wang, H.-M.: The acousticvisual emotion Guassians model for automatic generation of music video. In: ACM International Conference on Multimedia (2012) Wang, J.-C., Yang, Y.-H., Jhuo, I.-H., Lin, Y.-Y., Wang, H.-M.: The acousticvisual emotion Guassians model for automatic generation of music video. In: ACM International Conference on Multimedia (2012)
32.
Zurück zum Zitat Ye, G., Liu, D., Jhuo, I.-H., Chang, S.-F.: Robust late fusion with rank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition (2012) Ye, G., Liu, D., Jhuo, I.-H., Chang, S.-F.: Robust late fusion with rank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
33.
Zurück zum Zitat Ye, G., Jhuo, I.-H., Liu, D., Jiang, Y.G., Lee, D.-T., Chang, S.-F.: Joint audio-visual bi-modal codewords for video event detection. In: ACM International Conference on Multimedia Retrieval (2012) Ye, G., Jhuo, I.-H., Liu, D., Jiang, Y.G., Lee, D.-T., Chang, S.-F.: Joint audio-visual bi-modal codewords for video event detection. In: ACM International Conference on Multimedia Retrieval (2012)
Metadaten
Titel
Discovering joint audio–visual codewords for video event detection
verfasst von
I-Hong Jhuo
Guangnan Ye
Shenghua Gao
Dong Liu
Yu-Gang Jiang
D. T. Lee
Shih-Fu Chang
Publikationsdatum
01.01.2014
Verlag
Springer Berlin Heidelberg
Erschienen in
Machine Vision and Applications / Ausgabe 1/2014
Print ISSN: 0932-8092
Elektronische ISSN: 1432-1769
DOI
https://doi.org/10.1007/s00138-013-0567-0

Weitere Artikel der Ausgabe 1/2014

Machine Vision and Applications 1/2014 Zur Ausgabe