nach oben

Machine Vision and Applications

Erschienen in:

01.01.2014 | Special Issue Paper

Discovering joint audio–visual codewords for video event detection

verfasst von: I-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, Shih-Fu Chang

Erschienen in: Machine Vision and Applications | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Detecting complex events in videos is intrinsically a multimodal problem since both audio and visual channels provide important clues. While conventional methods fuse both modalities at a superficial level, in this paper we propose a new representation—called bi-modal words—to explore representative joint audio–visual patterns. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to produce the bi-modal words that reveal the joint patterns across modalities. Different pooling strategies are then employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations. Since it is difficult to predict the suitable number of bi-modal words, we generate bi-modal words at different levels (i.e., codebooks with different sizes), and use multiple kernel learning to combine the resulting multiple representations during event classifier learning. Experimental results on three popular datasets show that the proposed method achieves statistically significant performance gains over methods using individual visual and audio feature alone and existing popular multi-modal fusion methods. We also find that average pooling is particularly suitable for bi-modal representation, and using multiple kernel learning to combine multi-modal representations at various granularities is helpful.

Vorheriger Artikel Evaluating multimedia features and fusion for example-based event detection

Nächster Artikel Multimedia event detection with multimodal feature fusion and temporal concept localization

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Normally event detection is performed on video level, i.e., to detect whether a video contains an event of interest. Therefore, we represent each video by a feature vector.

http://en.wikipedia.org/wiki/P-value (2011)

http://www.nist.gov/itl/iad/mig/med10.cfm (2010)

http://www.nist.gov/itl/iad/mig/med11.cfm (2011)

Bao, L., et al.: Informedia @ TRECVID 2011. In: TRECVID Workshop (2011)

Beal, M., Jojic, N., Attias, H.: A graphical model for audiovisual object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 828–836 (2003)CrossRef

Boureau, Y.-L., Ponce, J., Lecun, Y.: A theoretical analysis of feature pooling in visual recognition. In: International Conference on Machine Learning (2010)

Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: European Conference on Computer Vision (2004)

Cristani, M., Bicego, M., Murino, V.: Audio-visual event recognition in surveillance video sequences. IEEE Trans. Multimedia (2007)

Dhillon, I.: Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM Conference on Knowledge Discovery and Data Mining (2001)

10.

Gehler, P., Nowozin, S.: On feature combination for multiclass object detection. In: IEEE International Conference on Computer Vision (2009)

11.

Jhuo, I.H., Lee, D.-T.: Boosting-based Multiple Kernel Learning forImage Re-ranking. In: ACM International Conference on Multimedia (2010)

12.

Jiang, W., Cotton, C., Chang, S.-F., Ellis, D., Loui, A.: Short-term audio-visual atoms for generic video concept classification. In: ACM International Conference on Multimedia (2009)

13.

Jiang, W., Loui, A.: Audio-visual grouplet: Temporal audio-visual interactions for general video concept classification. In: ACM International Conference on Multimedia (2011)

14.

Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D., Loui, A.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: ACM International Conference on Multimedia Retrieval (2011)

15.

Jiang, Y.-G., et al.: Columbia-ucf trecvid2010 multimedia event detection: combining multiple modalities, contextual concepts, and temporal matching. In: NIST TRECVID Workshop (2010)

16.

Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., Shah, M.: High-level event recognition in unconstrained videos. In: International Journal of Multimedia Information Retrieval, Vol. 2(2), pp. 73–101 (2012)

17.

Kembhavi, A., Siddiquie, B., Miezianko, R., McCloskey, S., Davis, L.S.: Incremental multiple kernel learning for object recognition. In: IEEE International Conference on Computer Vision (2009)

18.

Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRef

19.

Laptev, I., Lindeberg, T.: On space-time interest points. Int. J. Comput. Vision 64(2), 107–123 (2005)CrossRef

20.

Laptev, I., Marszlek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. IEEE Conf. Comput. Vision Pattern Recognit. 60(1), 63–86 (2008)

21.

Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view knowledge transfer. In: IEEE Conference on Computer Vision and Pattern Recognition (2011)

22.

Lutkepohl, H.: Handbook of Matrices. Wiley, Chichester (1997)

23.

Manning, C., Raghavan, P., Schtze, H.: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATH

24.

Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. Int. J. Comput. Vision 60(1), 63–86 (2004)CrossRef

25.

Natarajan, P., et al.: BBN VISER TRECVID 2011 multimedia event detection system. In: NIST TRECVID Workshop (2011)

26.

Pan, S., Nu, X., Sun, J.T., Yang, Q., Chen, Z.: Co-clustering documents and words using bipartite spectral graph partitioning. In: International World Wide Web Conference (2010)

27.

Pols, L.: Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words. Free University, Amsterdam (1966)

28.

Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: an overview. In: Issues in visual and audio-visual speech processing (2004)

29.

Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: SimpleMKL. J. Mach. Learn. Res. 9, 2491–2512 (2009)MathSciNet

30.

Vedaldi, A., Gulshan, V., Varma, M., Zisserman. A.: Multiple kernels for object detection. In: IEEE International Conference on Computer Vision (2009)

31.

Wang, J.-C., Yang, Y.-H., Jhuo, I.-H., Lin, Y.-Y., Wang, H.-M.: The acousticvisual emotion Guassians model for automatic generation of music video. In: ACM International Conference on Multimedia (2012)

32.

Ye, G., Liu, D., Jhuo, I.-H., Chang, S.-F.: Robust late fusion with rank minimization. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)

33.

Ye, G., Jhuo, I.-H., Liu, D., Jiang, Y.G., Lee, D.-T., Chang, S.-F.: Joint audio-visual bi-modal codewords for video event detection. In: ACM International Conference on Multimedia Retrieval (2012)

Titel: Discovering joint audio–visual codewords for video event detection
verfasst von: I-Hong Jhuo
Guangnan Ye
Shenghua Gao
Dong Liu
Yu-Gang Jiang
D. T. Lee
Shih-Fu Chang
Publikationsdatum: 01.01.2014
Verlag: Springer Berlin Heidelberg
Erschienen in: Machine Vision and Applications / Ausgabe 1/2014
Print ISSN: 0932-8092
Elektronische ISSN: 1432-1769
DOI: https://doi.org/10.1007/s00138-013-0567-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2014

A rule-based event detection system for real-life underwater domain

Key observation selection-based effective video synopsis for camera network

Image-based magnification calibration for electron microscope

Special issue on Multimedia Event Detection

Charting-based subspace learning for video-based human action classification

Active tracking and pursuit under different levels of occlusion: a two-layer approach