Elsevier

Neurocomputing

Volume 173, Part 3, 15 January 2016, Pages 1001-1016
Neurocomputing

Summarizing video sequence using a graph-based hierarchical approach

https://doi.org/10.1016/j.neucom.2015.08.057Get rights and content

Abstract

Video summarization is a simplification of video content for compacting the video information. The video summarization problem can be transformed into a clustering problem, in which some frames are selected to saliently represent the video content. In this work, we use a graph-based hierarchical clustering method for computing a video summary. In fact, the proposed approach, called HSUMM, adopts a hierarchical clustering method to generate a weight map from the frame similarity graph in which the clusters (or connected components of the graph) can easily be inferred. Moreover, the use of this strategy allows the application of a similarity measure between clusters during graph partition, instead of considering only the similarity between isolated frames. We also provide a unified framework for video summarization based on minimum spanning tree and weight maps in which HSUMM could be seen as an instance that uses a minimum spanning tree of frames and a weight map based on hierarchical observation scales computed over that tree. Furthermore, a new evaluation measure that assesses the diversity of opinions among users when they produce a summary for the same video, called Covering, is also proposed. During tests, different strategies for the identification of summary size and for the selection of keyframes were analyzed. Experimental results provide quantitative and qualitative comparison between the new approach and other popular algorithms from the literature, showing that the new algorithm is robust. Concerning quality measures, HSUMM outperforms the compared methods regardless of the visual feature used in terms of F-measure.

Introduction

The increasing number of video files has made the task of searching for a specific content very expensive, because it is necessary to index the video information. Usually, there are two approaches to cope with the index problem: (i) manual annotation; or (ii) automatic annotation. The former is expensive and subjective, since it depends on the experts to perform this annotation. The second one is objective and is directly related to the visual contents, however it depends on the features which are used. The cost to find a specific content related to a video depends on the index size, thus instead of considering all video content, one can summarize it in order to reduce the search space. In the literature, there are many approaches to simplify the video content [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]. Thus, video summarization is a simplification of video content in order to reduce the amount of data without losing information. The video summarization problem can be transformed into a clustering problem, in which some frames are selected to saliently represent the video content, as illustrated in Fig. 1.

In [5], the authors proposed an approach to cope with the video summarization problem in which the clustering is achieved by a K-means algorithm, but it is necessary to know a priori the number of clusters. In [8], it was proposed the use of graph-theoretic FCM algorithm for video summarization, however the graph creation is directly related to number of centers. In [6], the authors used a Delaunay triangulation to automatically identify the frame clusters, however this approach is expensive and produces very compressed summaries. VISTO [7] is based on the extraction of low-level color features of video frames and on a modification of Furthest Point-First algorithm to cluster the frames. This approach is fast but the summaries are big. In [12], a normalized cut algorithm is employed to globally and optimally partitionate a video into clusters. The authors in [14] have presented a keyframe-based interface for browsing video (which is represented by a keyframe hierarchy) on different devices. Their algorithm uses a non-temporal clustering method for locating a single keyframe and two new temporal clustering methods (backtrack-balanced and split-balanced) when a video clip is searched. Moreover, it is able to adapt both to the user׳s task and to the repetitiveness of the video. Recently, some new approaches for video summarization using constraint satisfaction programming [10] and prior information [11] were proposed. In [13], it was proposed a quasi real-time video summarization based on learning of a dictionary in which a video summary is then generated by combining segments that cannot be sparsely reconstructed using the learned dictionary. In [15], the authors have proposed a Bag-of-Importance model in order to effectively quantify the importance of individual local features in a transformed sparse space in which both the inter-frame and intra-frame properties of features are exploited and quantified for video summarization. In [16], the authors have proposed a multi-task deep visual-semantic embedding model to directly compute the similarity between the query and video thumbnails by mapping them into a common latent semantic space.

In [9], the authors use a graph-theoretic divisive clustering algorithm based on construction of a Minimum Spanning Tree (MST) to select video frames without segmenting the video into shots or scenes, eliminating preprocessing steps. It is important to note that according to [17] the MST approach for clustering is hierarchical, and thanks to this property, it is easy to compute a video summary regarding a specified number of keyframes. Following the same strategy, in this work, video summarization problem is also transformed into a graph-based clustering problem in which a cluster (or connected component of the graph), computed from a graph partition, represents a set of similar frames. Similar to [9], the proposed method – HSUMM – eliminates the preprocessing steps using a MST. But instead of computing the graph partition directly over the MST, the proposed approach adopts a hierarchical clustering method to generate a weight map from the MST in which the clusters can easily be inferred. Moreover, the use of a graph-based hierarchical clustering method allows HSUMM to apply a similarity measure between clusters (i.e., set of frames) during graph partition, instead of considering only the similarity between isolated frames. Distinct strategies could be used in HSUMM to select keyframes for saliently represent the video content. In this work, two different procedures (considering or not the graph weights) are analyzed. Finally, a new evaluation measure, called Covering, is also proposed. This measure assesses the diversity of opinions among users when they produce a summary for the same video, what is not taken into account when average values of quality measures are used, such as in [5].

The use of MST and the use of observation scales computed on MST were first introduce in our previous works [9] and [18], respectively. Moreover, in [18] – hereafter called HSUMM1 – only the unweighted center selection procedure was evaluated. The main differences of the this work in comparison with the previous ones are highlighted in the following:

  • An automatic approach for determining the number of frame clusters. Unlike our previous work [9] – hereafter called AGM1/AGM2 – in which the number of frame clusters is a parameter, an automatic approach is proposed for determining the number of frame clusters based on the homogeneity of each cluster.

  • A new keyframe selection based on a weighted graph. This strategy takes into account the dissimilarity values which are completely ignored in HSUMM1 [18].

  • New formulations. In order to clarify some explanations and to create a general framework, we revisit some definitions and formulations presented in HSUMM1 [18].

  • Generalization of graph-based hierarchical video summarization. In this work, we present a unified framework for video summarization based on MST and weight maps, which will facilitate the adoption and analysis of others strategies for building weight maps in the future.

The paper is organized as follows. Section 2 describes the clustering problem using minimum spanning tree and also defines many concepts used in our work. Section 3 describes our methodology to solve the video summarization problem, while Section 4 presents a new evaluation measure that can be used to compare video summarization methods. Section 5 describes the performed experiments together with a comparative analysis between our approach and the other methods. Finally, we give some conclusions in Section 6.

Section snippets

Some fundamental concepts

This section introduces main concepts used in this work. In order to facilitate the comprehension of our method, the main symbols used throughout the text are summarized in Table 1.

Video summarization

In this work, video summarization problem is transformed into a graph-based clustering problem in which a cluster (or connected component of the graph), computed from a graph partition, represents a set of similar frames. Fig. 3 illustrates the proposed method for video summarization, the so-called HSUMM. The method depends basically on (i) the visual feature selected for representing a frame; (ii) the dissimilarity measure used for comparing two frames; (iii) the computation of frame clusters

A new evaluation measure

The process of evaluating video summaries is not a simple task, since there is no well-known groundtruth and the quality of a summary is highly subjective. In this way, some initiatives have been proposed, like Comparison of User Summaries (CUS) [5], in which the video summary is built manually by a number of users from the sampled frames, in order to assess and compare automatic generated video summaries. In [7], the summaries are compared to Open Video2 summaries

Experiments

In order to compare HSUMM with other methods, the same dataset adopted in [7] composed of 50 videos in different genres (documentary, lectures, ephemeral, historical, educational) is used. With respect to the assessment, we consider the same measures proposed by [5] in terms of Comparison of User Summaries (CUS), in which the video summary is built manually by a number of users from sampled frames. In fact, the summaries are evaluated by using CUSa and CUSe calculated from 250 summaries of

Conclusion and further works

Video summarization is a simplification of video content for compacting the video information in which a small number of frames must be selected for saliently representing its content. To address video summarization, we proposed the use of a graph-based hierarchical clustering method, called HSUMM, which allows the application of a similarity measure between clusters during graph partition, instead of considering only the similarity between isolated frames. Moreover, we provide a unified

Acknowledgments

The authors are grateful to PUC Minas (Pontifícia Universidade Católica de Minas Gerais), MIC BH (Microsoft Innovation Center Belo Horizonte), CNPq, Capes and FAPEMIG for the partial financial support of this work. The authors also thank to the anonymous reviewers for their valuable comments and suggestions.

Luciana dos Santos Belo received her B.Sc. and M.Sc. degrees in Computer Science, from Pontifícia Universidade Católica de Minas Gerais (PUC Minas), in 2010 and 2014, respectively. She developed projects in Audio-Visual Information Processing Laboratory. She is currently a QA Analyst.

References (26)

  • S.E.F. de Avila et al.

    Vsumma mechanism designed to produce static video summaries and a novel evaluation method

    Pattern Recognit. Lett.

    (2011)
  • S. Avila et al.

    Pooling in image representationthe visual codeword point of view

    Comput. Vis. Image Underst.

    (2013)
  • Y.-P. Tan et al.

    Video scene clustering by graph partitioning

    Electron. Lett.

    (2003)
  • M. Yeung et al.

    Video visualization for compact presentation and fast browsing of pictorial content

    IEEE Trans. Circuits Syst. Video Technol.

    (1997)
  • S.-H. Zhu et al.

    Automatic video partition for high-level search

    Int. J. Comput. Sci. Eng. Syst.

    (2008)
  • Y. Rui, T. Huang, S. Mehrotra, Browsing and retrieving video content in a unified framework, in: IEEE Second Workshop...
  • P. Mundur et al.

    Keyframe-based video summarization using Delaunay clustering

    Int. J. Digit. Libr.

    (2006)
  • M. Furini, F. Geraci, M. Montangero, M. Pellegrini, Visto: visual storyboard for web video browsing, in: Proceedings of...
  • D. Besiris, F. Fotopoulou, G. Economou, S. Fotopoulos, Video summarization by a graph-theoretic fcm based algorithm,...
  • S.J.F. Guimarães, W. Gomes, A static video summarization method based on hierarchical clustering, in: 15th...
  • S.-A. Berrani, H. Boukadida, P. Gros, Constraint satisfaction programming for video summarization, in: 2013 IEEE...
  • A. Khosla, R. Hamid, C.-J. Lin, N. Sundaresan, Large-scale video summarization using web-image priors, in: 2013 IEEE...
  • C. Ngo, Y. Ma, H. Zhang, Automatic video summarization by graph modeling, in: 9th IEEE International Conference on...
  • Cited by (32)

    • First person video summarization using different graph representations

      2021, Pattern Recognition Letters
      Citation Excerpt :

      More recently, Ding et al. [20] proposed long video segmentation based on the spatio-temporal interest points with dynamic clustering, and traditional LSTM model is introduced to generate the video captions. Finally, the authors in [21] adopted a graph based hierarchical clustering approach for video summarization. In this paper, we present a solution to the egocentric video summarization problem using various graph representations.

    • A review of video surveillance systems

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Each technique summarizes the video using a specific feature, such as trajectories, moving objects, abnormal detection, and many others. These categories of techniques can be classified into two general categories, scene-based (i.e., static [164], dynamic [165] and content-based approaches, and the content- based approaches can be further decomposed into three types related to the content of the video including motion-based [166], action-based [167] and event-based [168]. Video summarization is a short version of the longer video sequence.

    • Learning to realign hierarchy for image segmentation

      2020, Pattern Recognition Letters
      Citation Excerpt :

      A hierarchical method seems to be an interesting approach to deal with that issue since it incorporates information from multiple scales. Moreover, the development of effective and efficient hierarchical segmentation methods may help to improve the results for several problems, because segmentation could be used as a basic tool in many computer vision tasks [4,21,22]. In the image domain, hierarchical image segmentation can be defined as a set of image segmentations at different detail levels in which the segmentations at coarser detail levels can be produced from simple merges of regions from segmentations at finer detail levels [14].

    • Efficient CNN based summarization of surveillance videos for resource-constrained devices

      2020, Pattern Recognition Letters
      Citation Excerpt :

      For instance, Kuanar et al. [36] presented a VS approach with intelligent utilization of bipartite graph matching for modelling inter-view dependencies in multi-view videos and optimum-path forest scheme for clustering. In continuation with this method, dos Santos Belo et al. [37] used graph-assisted hierarchical method for VS. In an attempt to extend the usefulness of VS, Bagheri et al. [38] proposed a method for temporal mapping of surveillance videos for indexing and searching.

    • Social media based event summarization by user–text–image co-clustering

      2019, Knowledge-Based Systems
      Citation Excerpt :

      For example, Ozdikis et al. [20] applied an evidential reasoning technique in order to estimate the geographical locations of events in Twitter. In recent years, there emerges many social media based applications, such as place-of-interest mining based on location information [32], personalized travel recommendation based on user interest and travel history [33], location estimation [21,22] and brand recommendation [34], video content summarization [35–38]. Many works have focused on event summarization based on the large amount of social media information: texts, images, user factors, location information.

    • Graph coloring based surveillance video synopsis

      2017, Neurocomputing
      Citation Excerpt :

      Fortunately, video condensation can effectively reduce memory storage for large data sets. Various condensation methods have been proposed to generate meaningful representations of original videos, such as video fast forward [1–3], video skimming [4], video abstraction [5–8], video summarization [9–14], video montage [15–17], and video synopsis [18–21]. These methods can be generally classified into two broad categories: frame-based approaches and object-based approaches.

    View all citing articles on Scopus

    Luciana dos Santos Belo received her B.Sc. and M.Sc. degrees in Computer Science, from Pontifícia Universidade Católica de Minas Gerais (PUC Minas), in 2010 and 2014, respectively. She developed projects in Audio-Visual Information Processing Laboratory. She is currently a QA Analyst.

    Carlos Antônio Caetano Jr. is a PhD student in Computer Science at Universidade Federal de Minas Gerais (UFMG). Received his B.Sc. M.Sc. degrees in Information Systems and Computer Science, from Pontifícia Universidade Católica de Minas Gerais (PUC Minas) and Universidade Federal de Minas Gerais (UFMG), in 2011 and 2014, respectively. He is currently at UFMG as a Researcher at Smart Surveillance Interest Group (SSIG/UFMG). His research interests include computer vision, smart surveillance and machine learning applications, with focus on visual pattern recognition.

    Zenilton Kleber Gonçalves do Patrocínio Jr. received B.Sc. in Computer Science from Universidade Federal de Minas Gerais, Brazil, and B.A. in Business Management from Fundação Mineira de Educação e Cultura, Brazil, both in 1989. In 1993 and 2005, respectively, he received his M.Sc. and Ph.D. degrees in computer science from Universidade Federal de Minas Gerais, Brazil. Zenilton is a member of the Computer Science Department (DCC) of Pontifícia Universidade Católica de Minas Gerais (PUC Minas), Brazil, since 2004. His research interests include graph-based methods in pattern recognition, optimization algorithms, machine learning, and their applications to digital image and video processing, computer vision and, recently, to multimodal content based information retrieval.

    Silvio Jamil Ferzoli Guimarães received his B.Sc. M.Sc. and D.Sc. degrees in Computer Science, from Universidade Federal de Viçosa (UFV), Universidade Estadual de Campinas (UNICAMP) and Universidade Federal de Minas Gerais (UFMG), in 1997, 1999 and 2003, respectively. Thanks to co-supervision program between UFMG and Université Marne-la-Vallée, he received another D.Sc. degree in Informatique. Silvio is a member of the Computer Science Department (DCC) staff, Pontifícia Universidade Católica de Minas Gerais (PUC Minas), Belo Horizonte-MG, Brazil (since 2002). He is the founder and header of the Audio-Visual Information Processing Laboratory. His research interests include digital image and video processing and computer vision applications, hierarchical information analysis, multimedia information systems, and content based information (image and video) retrieval.

    View full text