Elsevier

Neurocomputing

Volume 145, 5 December 2014, Pages 140-153
Neurocomputing

Spatial and temporal visual attention prediction in videos using eye movement data

https://doi.org/10.1016/j.neucom.2014.05.049Get rights and content

Abstract

Visual attention detection in static images has achieved outstanding progress in recent years whereas much less effort has been devoted to learning visual attention in video sequences. In this paper, we propose a novel method to model spatial and temporal visual attention for videos respectively through learning from human gaze data. The spatial visual attention mainly predicts where viewers look in each video frame while the temporal visual attention measures which video frame is more likely to attract viewers׳ interest. Our underlying premise is that objects as well as their movements, instead of conventional contrast-related information, are major factors in dynamic scenes to drive visual attention. Firstly, the proposed models extract two types of bottom-up features derived from multi-scale object filter responses and spatiotemporal motion energy, respectively. Then, spatiotemporal gaze density and inter-observer gaze congruency are generated using a large collection of human-eye gaze data to form two training sets. Finally, prediction models of temporal visual attention and spatial visual attention are learned based on those two training sets and bottom-up features, respectively. Extensive evaluations on publicly available video benchmarks and applications in interestingness prediction of movie trailers demonstrate the effectiveness of the proposed work.

Introduction

The human visual attention mechanism plays an important role in selecting the most informative contents from the massive input for further processing. In the engineering field, computational visual attention models have been developed to mimic the human visual attention mechanism, which automatically estimates which locations are more likely to attract viewers׳ attention in a scene based on analyzing visual features. These models can serve as a foundation for many multimedia applications such as image/video retrieval [1], image segmentation [2], [37], [41], video summarization [3], video retargeting [4], and video surveillance [5].

In recent years, most efforts have been devoted to studying computational models for static images whereas limited work has been performed to explore visual attention of users when they watch dynamic videos. As videos are fundamentally different from static images, those computational models developed for static images cannot be applied to dynamic videos directly. For video understanding, two aspects of visual attention need to be addressed: spatial visual attention (SVA) and temporal visual attention (TVA). The computation of SVA is to predict where viewers look in each video frame whereas the TVA mainly aims to study when (at which frames) viewers are more likely to be attracted by the video contents. SVA has gained relatively more studies in the literature. In contrast, as far as we know, very limited work has been done to model TVA systematically.

Motivated by the findings that visual attention and eye movement are highly correlated [6], a recent promising idea is to implement eye trackers to investigate the relation and further build the mapping between eye fixations and visual features. In addition, recent research [8] suggests that the allocation of attention while image/video viewing is primarily driven by a number of the higher-level cognitive factors, e.g., meaningful objects. This differs from most traditional methods, in which only global or local contrast information is considered. The concept [8] sounds interesting, however, there lack fully automatic computational models of visual attention via modeling “meaningful” objects.

In this paper, we attempt to develop a unified framework to model both SVA and TVA for videos by learning from human gaze data. Our underlying assumption is that objects and their movements instead of traditional contrast-related information are major factors in dynamic scenes to drive human visual attention. The proposed framework consists of three major components. First, we extract two types of bottom-up features derived from multi-scale object filter responses and oriented spatiotemporal motion energy, respectively. Second, spatiotemporal gaze density and inter-observer gaze congruency are generated using a large collection of human-eye gaze data, forming two training sets. Then, prediction models of temporal visual attention and spatial visual attention are learned based on those two training sets and the bottom-up features, respectively.

The rest in this paper is organized as follows. Section 2 reviews related works. Section 3 describes how to model SVA and Section 4 presents the computation of TVA. Experimental results are shown in Section 5. One possible application of TVA is described in Section 6. Finally, conclusions are drawn in Section 7.

Section snippets

Related works

The computation of SVA has been extensively examined, especially for static images. Itti et al. [10] proposed the first and most influential model that linearly combines multiple attributes including intensity, color, and orientation contrast into a saliency map. The obtained saliency map topographically encodes stimulus conspicuity and thus can be used to indicate the salient regions in an image. Instead of using a predefined criterion to combine various features, another representative work

Computation of SVA

In this section, we propose a spatiotemporal saliency prediction model. Our basic assumption is that some interesting objects may attract more observers׳ attention. The SVA model essentially achieves the inference of object interestingness, which is implemented by the learning from the eye-tracking data. In our model, each region is associated with a likelihood that it belongs to a certain object. Then, a learning algorithm is exploited to train a classifier to build a mapping between the

Computation of TVA

Eye-gaze is a non-invasively recorded proxy for temporal visual attention. We define the TVA value from the perspective of variability between the visual gazes of observers viewing the same dynamic scenes. Given a dynamic scene, when most viewers look at similar locations, it strongly confirms that there do exist a few interesting objects in the scene and this scene is, therefore, more interesting. On the contrary, if a scene contains nothing that stands out the background or the fixation

Experimental results

In this section, experimental results on the eye tracking video dataset are presented. We conduct two experiments to verify the effectiveness of the proposed SVA model and TVA model. The first experiment aims to test the capability of our SVA model to predict human eye fixations in natural environment. The second experiment is to test the performance of our TVA model for predicting the attractiveness of a video. All the experiments are carried out on a Xeon x5660×2.8 GHz computer with 64 GB

One application of TVA: whether movie trailers are interesting

Movie trailers foretell some attractive plots of the new movies and arouse viewers’ curiosity and interest. The interestingness of a movie trailer may largely influence the movie׳s box office earnings. As far as we know, there is a lack of method that enables the quantitative estimation of interestingness for a movie trailer. In this section, we apply our TVA model to this task.

Based on TVA for one frame obtained by the method described in Section 4, here we calculate TVA for the whole movie

Conclusion

In this paper, we have proposed a comprehensive and systematical framework of analyzing human visual attention on natural videos both from the spatial and temporal perspective. Our major contributions can be summarized as follows. First, we have proposed a novel SVA model to predict locations of interest in each frame from video signal via learning the mapping from object-based features to human eye fixation density. Experimental results have shown that this approach is more robust than

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China under Grant 61103061 and 91120005, NPU-FFR-JC20120237, Program for New Century Excellent Talents in University under grant NCET-10-0079, and Doctoral Fund of Ministry of Education of the People׳s Republic of China under grant 20136102110037.

Junwei Han received his Ph.D. degree from Northwestern Polytechnical University, Xi׳an, China, in 2003. He is currently a professor with Northwestern Polytechnical University. His research interests include computer vision and multimedia processing.

References (43)

  • W. Einhäuser et al.

    Objects predict fixations better than early saliency

    J. Vision

    (2008)
  • H. Hadizadeh et al.

    Eye-tracking database for a set of standard video sequences

    IEEE Trans. Image. Process.

    (2012)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern. Anal.

    (1998)
  • T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: Proceedings of IEEE Conference...
  • L. Zhang, M. Tong, G. Cottrell, SUNDAy: saliency using natural statistics for dynamic analysis of scenes, in:...
  • S. Marat et al.

    Modelling spatiotemporal saliency to predict gaze direction for short videos

    Int. J. Comput. Vis.

    (2009)
  • J. Han

    Object segmentation from consumer videos: a unified framework based on visual attention

    IEEE Trans. Consum. Electron.

    (2009)
  • Y. Zhai, M. Shah, Visual attention detection in video sequences using spatiotemporal cues, in: Proceedings of ACM...
  • H.J. Seo et al.

    Static and space-time visual saliency detection by self-resemblance

    J. Vis.

    (2009)
  • J. Harel et al.

    Graph-based visual saliency

    Adv. Neural Inf. Process. Syst.

    (2007)
  • E. Rahtu, J. Kannala, M. Salo, J. Heikkilä, Segmenting salient objects from images and videos, in: Proceedings of...
  • Cited by (28)

    • Mining attention distribution paradigm: Discover gaze patterns and their association rules behind the visual image

      2023, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Eye movement data is believed to provide a credible and effective indicator of visual attention distribution, which is obtained by tracking the position of the eye's fixation point or the movement of the eyeball relative to the head. A large amount of eye movement data is applied to extract the characteristics of spatiotemporal fixation point density and consistency, then to establish the visual attention prediction model [6]. There are studies using eye-tracking data to analyze the eye movement patterns and attention distribution for diagnosis attention deficit related disease [7], cognitive function in patients with neurological disorders [8] and understanding aviation psychology [1].

    • ChaboNet: Design of a deep CNN for prediction of visual saliency in natural video

      2019, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      The second phase provides the combination of different localized representations. While a significant effort has been already made for building saliency prediction models from still images with deep learning approach, very few have been built for video content with it [32]. Video has a supplementary dimension: the temporality that is expressed by apparent motion in the image plane.

    • A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos

      2018, Neurocomputing
      Citation Excerpt :

      The finally feature vector for each pixel is an integrated feature with 240 dimensions. For regression purpose, a linear SVM is adopted for its simplicity and effectiveness [23] [102]. A five-fold cross-validation is employed to optimize the parameters.

    View all citing articles on Scopus

    Junwei Han received his Ph.D. degree from Northwestern Polytechnical University, Xi׳an, China, in 2003. He is currently a professor with Northwestern Polytechnical University. His research interests include computer vision and multimedia processing.

    Liye Sun received his BS degree from the Northwestern Polytechnical University, China, in 2011. She is currently pursuing the Ph.D. degree at Northwestern Polytechnical University. Her research interests include computer vision and multimedia processing.

    Xintao Hu received his M.S. and Ph.D degrees from the Northwestern Polytechnical University, China, in 2005 and 2011, respectively. He is currently a postdoc with School of Automation at NWPU. His research interests include computational brain imaging and its application in computer vision.

    Jungong Han received his Ph.D. degree in Telecommunication and Information System from XiDian University, China, in 2004. From 2005 to 2010, he was with Signal Processing Systems group at the Technical University of Eindhoven (TU/e), The Netherlands. In December of 2010, he joined the Multi-Agent and Adaptive Computation group at the Centre for Mathematics and Computer Science (CWI) in Amsterdam. In July of 2012, he started a senior scientist position with Civolution technology in Eindhoven (a combining synergy of Philips Content Identification and Thomson STS). Dr. Han׳s research interests include multimedia content identification, multi-sensor data fusion, and computer vision. He has written and co-authored over 70 papers. He is an associate editor of Elsevier Neurocomputing.

    Ling Shao received the B.Eng. degree in electronic engineering from the University of Science and Technology of China, Hefei, China, the M.Sc. degree in medical image analysis, and the Ph.D. (D.Phil.) degree in computer vision from the University of Oxford, Oxford, U.K.

    He is currently a Senior Lecturer (Associate Professor) with the Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield, U.K., and is a Guest Professor with Nanjing University of Information Science and Technology, China. Before joining Sheffield University, he was a Senior Scientist with Philips Research, Eindhoven, The Netherlands. He has authored/co-authored over 100 journal and conference papers and holds over 10 patents. His current research interests include computer vision, pattern recognition, and video processing.

    Dr. Shao is an Associate Editor of IEEE Transactions on Cybernetics, Neurocomputing, the International Journal of Image and Graphics, and the EURASIP Journal on Advances in Signal Processing, and has edited several special issues for journals of IEEE, Elsevier and Springer. He has organized several workshops with top conferences, such as ICCV, ACM Multimedia and ECCV. He has been serving as a Program Committee member for many international conferences, including ICCV, CVPR, ECCV, ACM MM, BMVC, and so on. He is also a Fellow of the British Computer Society.

    View full text