nach oben

International Journal of Computer Vision

Erschienen in:

01.05.2016

Learning from Multiple Sources for Video Summarisation

verfasst von: Xiatian Zhu, Chen Change Loy, Shaogang Gong

Erschienen in: International Journal of Computer Vision | Ausgabe 3/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Many visual surveillance tasks, e.g. video summarisation, is conventionally accomplished through analysing imagery-based features. Relying solely on visual cues for public surveillance video understanding is unreliable, since visual observations obtained from public space CCTV video data are often not sufficiently trustworthy and events of interest can be subtle. We believe that non-visual data sources such as weather reports and traffic sensory signals can be exploited to complement visual data for video content analysis and summarisation. In this paper, we present a novel unsupervised framework to learn jointly from both visual and independently-drawn non-visual data sources for discovering meaningful latent structure of surveillance video data. In particular, we investigate ways to cope with discrepant dimension and representation whilst associating these heterogeneous data sources, and derive effective mechanism to tolerate with missing and incomplete data from different sources. We show that the proposed multi-source learning framework not only achieves better video content clustering than state-of-the-art methods, but also is capable of accurately inferring missing non-visual semantics from previously-unseen videos. In addition, a comprehensive user study is conducted to validate the quality of video summarisation generated using the proposed multi-source model.

Vorheriger Artikel Complex Non-rigid 3D Shape Recovery Using a Procrustean Normal Distribution Mixture Model

Nächster Artikel Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Spatio-temporal combinations of human activity or interaction patterns, e.g. gathering, or environmental state changes, e.g. raining.

Also known as the heteroscedasticity problem (Duin and Loog 2004).

There exist missing data filling algorithms utilised in conventional random forests, e.g. for the missing value of one feature in one class, the median value (continuous) or the most frequent category (discrete) of this feature over the current class can be used as the estimation (Breiman 2003). Whilst a similar strategy is possible to apply on our MSC-Forest, we consider an alternative by proposing an effective adaptive weighting algorithm in order not to further introduce noisy training data.

It is worth noticing that the purpose of this clustering step is completely different from the multi-source data clustering during model training, as presented in Sect. 3.3. The latter is a component of our multi-source model training pipeline (Fig. 2), whilst the former aims at revealing the latent structure over testing data for video summarisation.

Datasets available: www.eecs.qmul.ac.uk/%7Exz303/download.html.

No vehicle detection on the ERCe dataset.

Evaluating a forest that takes only non-visual inputs is not possible, since non-visual data is not available for previously-unseen video footages.

VNV-MSC-Forest-hard shares the same clusters as VNV-MSC-Forest.

The event of interest is analogous to important objects/regions in (Lee et al. 2012).

The inferred non-visual tags include weather, traffic conditions, and typicality. The typicality tag, i.e. usual and interesting, of each clip, is computed based on the size of their assigned clusters (Fig. 4c). Clips assigned to the top \(20\,\%\) smallest clusters are treated as ‘interesting’.

Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., & Hwang, D.U. (2006). Complex networks: Structure and dynamics. Physics reports (pp. 175–308).

Bosch, A., Zisserman, A., & Munoz, X. (2007). Image classification using random forests and ferns. In IEEE international conference on computer vision.

Breiman, L. (2001). Random forests. Machine Learning, 45, 5.CrossRefMATH

Breiman, L. (2003). Rf/tools: A class of two-eyed algorithms. In: SIAM Workshop, Statistics Department, UC Berkeley.

Breiman, L., Friedman, J., Stone, C., & Olshen, R. (1984). Classification and regression trees. New York: Chapman & Hall/CRC.MATH

Cai, X., Nie, F., Huang, H., & Kamangar, F. (2011). Heterogeneous image feature integration via multi-modal spectral clustering. In IEEE conference on computer vision and pattern recognition.

Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008). An empirical evaluation of supervised learning in high dimensions. In International conference on machine learning.

Chan, A. B., & Vasconcelos, N. (2008). Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 909–926.CrossRef

Chu, W. S., Song, Y., & Jaimes, A. (2015). Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Conference on Computer Vision and Pattern Recognition, 30, 3584–3592.

Cong, Y., Yuan, J., & Luo, J. (2012). Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia, 14(1), 66–75.CrossRef

Criminisi, A., & Shotton, J. (2012). Decision forests: A unified framework. Foundations and trends in computer graphics and vision (pp. 81–227).

Duin, R., & Loog, M. (2004). Linear dimensionality reduction via a heteroscedastic extension of lda: the chernoff criterion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 732–739.CrossRef

Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.CrossRef

Feng, S., Lei, Z., Yi, D., & Li, S. Z. (2012). Online content-aware video condensation. In IEEE conference on computer vision and pattern recognition.

Fu, Y., Hospedales, T., Xiang, T., & Gong, S. (2013). Learning multi-modal latent attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 303–316.

Gall, J., Yao, A., Razavi, N., Gool, L. J. V., & Lempitsky, V. S. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (pp. 2188–2202).

Gong, S., Loy, C. C., & Xiang, T. (2011). Security and surveillance. Visual Analysis of Humans (pp. 455–472). Berlin: Springer.CrossRef

Gong, Y. (2003). Summarizing audiovisual contents of a video program. EURASIP Journal on Advances in Signal Processing, 2003, 160–169.CrossRef

Gygli, M., & Van Gool, H. G. L. (2015). Video summarization by learning submodular mixtures of objectives. In IEEE conference on computer vision and pattern recognition (pp. 3090–3098).

Gygli, M., Grabner, H., Riemenschneider, H., & Van Gool, L. (2014). Creating summaries from user videos. In European conference on computer vision (pp. 505–520).

Heer, J., & Chi, E. H. (2001). Identification of web user traffic composition using multi-modal clustering and information scent. In Proceedings of the workshop on web mining, SIAM conference on data mining (pp. 51–58).

Hospedales, T. M., Li, J., Gong, S., & Xiang, T. (2011). Identifying rare and subtle behaviors: a weakly supervised joint topic model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 2451–2464.CrossRef

Huang, H. C., Chuang, Y. Y., & Chen, C. S. (2012). Affinity aggregation for spectral clustering. In IEEE conference on computer vision and pattern recognition.

Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.CrossRef

Kang, H., Chen, X., Matsushita, Y., & Tang, X. (2006). Space-time video montage. In IEEE conference on computer vision and pattern recognition.

Karydis, I., Nanopoulos, A., Gabriel, H. H., & Spiliopoulou, M. (2009). Tag-aware spectral clustering of music items. In The international society for music information retrieval (pp. 159–164).

Khalidov, V., Forbes, F., & Horaud, R. (2011). Conjugate mixture models for clustering multimodal data. Neural Computation, 23, 517–557.MathSciNetCrossRefMATH

Khosla, A., Hamid, R., Lin, C. J., & Sundaresan, N. (2013). Large-scale video summarization using web-image priors. In IEEE conference on computer vision and pattern recognition (pp. 2698–2705).

Kim, C., & Hwang, J. N. (2002). Object-based video abstraction for video surveillance systems. IEEE Transactions on Circuits and Systems for Video Technology, 12, 1128–1138.CrossRef

Kim, G., Sigal, L., & Xing, E. P. (2014). Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In IEEE conference on computer vision and pattern recognition (pp. 4225–4232).

Kratz, L., & Nishino, K. (2012). Going with the flow: pedestrian efficiency in crowded scenes. In European conference on computer vision.

Lee, Y. J., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In IEEE conference on computer vision and pattern recognition.

Li, W., Mahadevan, V., & Vasconcelos, N. (2013). Anomaly detection and localization in crowded scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 18–32.

Liu, B., Xia, Y., & Yu, P. S. (2000). Clustering through decision tree construction. In Conference on information and knowledge management.

Loy, C. C., Xiang, T., & Gong, S. (2012). Incremental activity modeling in multiple disjoint cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 1799–1813.CrossRef

Lu, Z., Grauman, K. (2013a). Story-driven summarization for egocentric video. In IEEE conference on computer vision and pattern recognition.

Lu, Z., Grauman, K. (2013b). Story-driven summarization for egocentric video. In IEEE conference on computer vision and pattern recognition (pp. 2714–2721).

Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19–60.MathSciNetMATH

Martin, J. K. (1997). An exact probability metric for decision tree splitting and stopping. Machine Learning, 28, 257–291.CrossRef

Money, A. G., & Agius, H. (2008). Video summarisation: A conceptual framework and survey of the state of the art. Journal of Visual Communication and Image Representation, 19, 121–143.CrossRef

Moosmann, F., Nowak, E., & Jurie, F. (2008). Randomized clustering forests for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1632–1646.CrossRef

Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 971–987.CrossRefMATH

Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42, 145–175.CrossRefMATH

Perbet, F., Stenger, B., & Maki, A. (2009). Random forest clustering and application to video segmentation. In British machine vision conference.

Potapov, D., Douze, M., Harchaoui, Z., & Schmid, C. (2014). Category-specific video summarization. In European conference on computer vision (pp. 540–555).

Pritch, Y., Rav-Acha, A., Gutman, A., & Peleg, S. (2007). Webcam synopsis: Peeking around the world. In The IEEE international conference on computer vision.

Pritch, Y., Rav-Acha, A., & Peleg, S. (2008). Nonchronological video synopsis and indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1971–1984.CrossRef

Schulter, S., Leistner, C., Wohlhart, P., Roth, P. M., & Bischof, H. (2013a). Alternating regression forests for object detection and pose estimation. In IEEE international conference on computer vision.

Schulter, S., Wohlhart, P., Leistner, C., Saffari, A., Roth, P. M., & Bischof, H. (2013b). Alternating decision forests. In IEEE conference on computer vision and pattern recognition.

Shi, T., & Horvath, S. (2006). Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics, 15, 118–138.MathSciNetCrossRef

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., & Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In IEEE conference on computer vision and pattern recognition.

Strehl, A., & Ghosh, J. (2003). Cluster ensembles—A knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3, 583–617.MathSciNetMATH

Sun, M., Farhadi, A., & Seitz, S. (2014). Ranking domain-specific highlights by analyzing edited videos. In European conference on computer vision (pp. 787–802).

Taskiran, C., Pizlo, Z., Amir, A., Ponceleon, D., & Delp, E. (2006). Automated video program summarization using speech transcripts. IEEE Transactions on Multimedia, 8, 775–791.CrossRef

Toderici, G., Aradhye, H., Pasca, M., Sbaiz, L., & Yagnik, J. (2010). Finding meaning on youtube: Tag recommendation and category discovery. In IEEE Conference on Computer Vision and Pattern Recognition.

Topchy, A., Jain, A. K., & Punch, W. (2005). Clustering ensembles: Models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12), 1866–1881.CrossRef

Truong, B. T., & Venkatesh, S. (2007). Video abstraction: A systematic review and classification. ACM transactions on multimedia computing, communications, and applications.

Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al. (2001). Constrained k-means clustering with background knowledge. International Conference on Machine learning, 1, 577–584.

Wang, M., Hong, R., Li, G., Zha, Z., Yan, Z. J., Yan, S., et al. (2012). Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia, 14, 975–985.CrossRef

Wang, X., Ma, X., & Grimson, W. E. L. (2009). Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 539–555.CrossRef

Wang, Z., Zhao, M., Song, Y., Kumar, S., & Li, B. (2010). Youtubecat: Learning to categorize wild web videos. In IEEE conference on computer vision and pattern recognition.

Wolf, W. (1996). Keyframe selection by motion analysis. In IEEE international conference on acoustics, speech, and signal processing.

Wu, S., Moore, B. E., Shah, M. (2010). Chaotic invariants of lagrangian particle trajectories for anomaly detection in crowded scenes. In IEEE conference on computer vision and pattern recognition (pp. 2054–2060).

Xing, E. P., Jordan, M. I., Russell, S., & Ng, A. Y. (2002). Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems (pp. 505–512).

Zelnik-manor, L., & Perona, P. (2004). Self-tuning spectral clustering. In Advances in neural information processing systems.

Zhang, D. Q., Lin, C. Y., Chang, S. F., & Smith, J. R. (2004). Semantic video clustering across sources using bipartite spectral clustering. In IEEE international conference on multimedia and expo.

Zhang, H., Wu, J., Zhong, D., & Smoliar, S. W. (1997). An integrated system for content-based video retrieval and browsing. Patten Recognition, 30, 643–658.CrossRef

Zhao, B., & Xing, E. P. (2014). Quasi real-time summarization for consumer videos. In IEEE conference on computer vision and pattern recognition (pp. 2513–2520).

Zhao, Y., & Karypis, G. (2004). Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning (pp. 311–331).

Zhu, X., Loy, C. C., & Gong, S. (2014). Constructing robust affinity graphs for spectral clustering. In Proceedings of the 27th IEEE conference on computer vision and pattern recognition (pp. 1450–1457).

Titel: Learning from Multiple Sources for Video Summarisation
verfasst von: Xiatian Zhu
Chen Change Loy
Shaogang Gong
Publikationsdatum: 01.05.2016
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 3/2016
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-015-0864-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2016

Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference

Learning Grammars for Architecture-Specific Facade Parsing

Midrange Geometric Interactions for Semantic Segmentation

Complex Non-rigid 3D Shape Recovery Using a Procrustean Normal Distribution Mixture Model

Premium Partner