Spatial and temporal visual attention prediction in videos using eye movement data

doi:10.1016/j.neucom.2014.05.049

Neurocomputing

Volume 145, 5 December 2014, Pages 140-153

https://doi.org/10.1016/j.neucom.2014.05.049 Get rights and content

Abstract

Visual attention detection in static images has achieved outstanding progress in recent years whereas much less effort has been devoted to learning visual attention in video sequences. In this paper, we propose a novel method to model spatial and temporal visual attention for videos respectively through learning from human gaze data. The spatial visual attention mainly predicts where viewers look in each video frame while the temporal visual attention measures which video frame is more likely to attract viewers׳ interest. Our underlying premise is that objects as well as their movements, instead of conventional contrast-related information, are major factors in dynamic scenes to drive visual attention. Firstly, the proposed models extract two types of bottom-up features derived from multi-scale object filter responses and spatiotemporal motion energy, respectively. Then, spatiotemporal gaze density and inter-observer gaze congruency are generated using a large collection of human-eye gaze data to form two training sets. Finally, prediction models of temporal visual attention and spatial visual attention are learned based on those two training sets and bottom-up features, respectively. Extensive evaluations on publicly available video benchmarks and applications in interestingness prediction of movie trailers demonstrate the effectiveness of the proposed work.

Introduction

The human visual attention mechanism plays an important role in selecting the most informative contents from the massive input for further processing. In the engineering field, computational visual attention models have been developed to mimic the human visual attention mechanism, which automatically estimates which locations are more likely to attract viewers׳ attention in a scene based on analyzing visual features. These models can serve as a foundation for many multimedia applications such as image/video retrieval [1], image segmentation [2], [37], [41], video summarization [3], video retargeting [4], and video surveillance [5].

In recent years, most efforts have been devoted to studying computational models for static images whereas limited work has been performed to explore visual attention of users when they watch dynamic videos. As videos are fundamentally different from static images, those computational models developed for static images cannot be applied to dynamic videos directly. For video understanding, two aspects of visual attention need to be addressed: spatial visual attention (SVA) and temporal visual attention (TVA). The computation of SVA is to predict where viewers look in each video frame whereas the TVA mainly aims to study when (at which frames) viewers are more likely to be attracted by the video contents. SVA has gained relatively more studies in the literature. In contrast, as far as we know, very limited work has been done to model TVA systematically.

Motivated by the findings that visual attention and eye movement are highly correlated [6], a recent promising idea is to implement eye trackers to investigate the relation and further build the mapping between eye fixations and visual features. In addition, recent research [8] suggests that the allocation of attention while image/video viewing is primarily driven by a number of the higher-level cognitive factors, e.g., meaningful objects. This differs from most traditional methods, in which only global or local contrast information is considered. The concept [8] sounds interesting, however, there lack fully automatic computational models of visual attention via modeling “meaningful” objects.

In this paper, we attempt to develop a unified framework to model both SVA and TVA for videos by learning from human gaze data. Our underlying assumption is that objects and their movements instead of traditional contrast-related information are major factors in dynamic scenes to drive human visual attention. The proposed framework consists of three major components. First, we extract two types of bottom-up features derived from multi-scale object filter responses and oriented spatiotemporal motion energy, respectively. Second, spatiotemporal gaze density and inter-observer gaze congruency are generated using a large collection of human-eye gaze data, forming two training sets. Then, prediction models of temporal visual attention and spatial visual attention are learned based on those two training sets and the bottom-up features, respectively.

The rest in this paper is organized as follows. Section 2 reviews related works. Section 3 describes how to model SVA and Section 4 presents the computation of TVA. Experimental results are shown in Section 5. One possible application of TVA is described in Section 6. Finally, conclusions are drawn in Section 7.

Section snippets

Related works

The computation of SVA has been extensively examined, especially for static images. Itti et al. [10] proposed the first and most influential model that linearly combines multiple attributes including intensity, color, and orientation contrast into a saliency map. The obtained saliency map topographically encodes stimulus conspicuity and thus can be used to indicate the salient regions in an image. Instead of using a predefined criterion to combine various features, another representative work

Computation of SVA

In this section, we propose a spatiotemporal saliency prediction model. Our basic assumption is that some interesting objects may attract more observers׳ attention. The SVA model essentially achieves the inference of object interestingness, which is implemented by the learning from the eye-tracking data. In our model, each region is associated with a likelihood that it belongs to a certain object. Then, a learning algorithm is exploited to train a classifier to build a mapping between the

Computation of TVA

Eye-gaze is a non-invasively recorded proxy for temporal visual attention. We define the TVA value from the perspective of variability between the visual gazes of observers viewing the same dynamic scenes. Given a dynamic scene, when most viewers look at similar locations, it strongly confirms that there do exist a few interesting objects in the scene and this scene is, therefore, more interesting. On the contrary, if a scene contains nothing that stands out the background or the fixation

Experimental results

In this section, experimental results on the eye tracking video dataset are presented. We conduct two experiments to verify the effectiveness of the proposed SVA model and TVA model. The first experiment aims to test the capability of our SVA model to predict human eye fixations in natural environment. The second experiment is to test the performance of our TVA model for predicting the attractiveness of a video. All the experiments are carried out on a Xeon x5660×2.8 GHz computer with 64 GB

One application of TVA: whether movie trailers are interesting

Movie trailers foretell some attractive plots of the new movies and arouse viewers’ curiosity and interest. The interestingness of a movie trailer may largely influence the movie׳s box office earnings. As far as we know, there is a lack of method that enables the quantitative estimation of interestingness for a movie trailer. In this section, we apply our TVA model to this task.

Based on TVA for one frame obtained by the method described in Section 4, here we calculate TVA for the whole movie

Conclusion

In this paper, we have proposed a comprehensive and systematical framework of analyzing human visual attention on natural videos both from the spatial and temporal perspective. Our major contributions can be summarized as follows. First, we have proposed a novel SVA model to predict locations of interest in each frame from video signal via learning the mapping from object-based features to human eye fixation density. Experimental results have shown that this approach is more robust than

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China under Grant 61103061 and 91120005, NPU-FFR-JC20120237, Program for New Century Excellent Talents in University under grant NCET-10-0079, and Doctoral Fund of Ministry of Education of the People׳s Republic of China under grant 20136102110037.

Junwei Han received his Ph.D. degree from Northwestern Polytechnical University, Xi׳an, China, in 2003. He is currently a professor with Northwestern Polytechnical University. His research interests include computer vision and multimedia processing.

References (43)

L. Shao et al.
Specific object retrieval based on salient regions
Pattern Recogn.
(2006)
J. Han et al.
Fast saliency-aware multi-modality image fusion
Neurocomputing
(2013)
L. Itti et al.
Bayesian surprise attracts human attention
Vision Res.
(2009)
X. Cui et al.
Temporal spectral residual for fast salient motion detection
Neurocomputing
(2012)
R. Peters et al.
Components of bottom-up gaze allocation in natural images
Vis. Res.
(2005)
J. Han et al.
Unsupervised extraction of visual attention objects in color images
IEEE Trans. Circ. Syst. Video Technol.
(2006)
Y. Ma et al.
A generic framework of user attention model and its application in video summarization
IEEE Trans. Multimedia
(2005)
M. Rubinstein et al.
Improved seam carving for video retargeting
ACM Trans. Graphic
(2008)
C. Maioli et al.
The integration of parallel and serial processing mechanisms in visual search: evidence from eye movement recordings
Eur. J. Neurosci.
(2001)
L.J. Li, H. Su, E.P. Xing, F.F. Li, Object bank: a high-level image representation for scene classification and...

W. Einhäuser et al.

Objects predict fixations better than early saliency

J. Vision

(2008)

H. Hadizadeh et al.

Eye-tracking database for a set of standard video sequences

IEEE Trans. Image. Process.

(2012)

L. Itti et al.

A model of saliency-based visual attention for rapid scene analysis

IEEE Trans. Pattern. Anal.

(1998)

T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: Proceedings of IEEE Conference...

L. Zhang, M. Tong, G. Cottrell, SUNDAy: saliency using natural statistics for dynamic analysis of scenes, in:...

S. Marat et al.

Modelling spatiotemporal saliency to predict gaze direction for short videos

Int. J. Comput. Vis.

(2009)

J. Han

Object segmentation from consumer videos: a unified framework based on visual attention

IEEE Trans. Consum. Electron.

(2009)

Y. Zhai, M. Shah, Visual attention detection in video sequences using spatiotemporal cues, in: Proceedings of ACM...

H.J. Seo et al.

Static and space-time visual saliency detection by self-resemblance

J. Vis.

(2009)

J. Harel et al.

Graph-based visual saliency

Adv. Neural Inf. Process. Syst.

(2007)

E. Rahtu, J. Kannala, M. Salo, J. Heikkilä, Segmenting salient objects from images and videos, in: Proceedings of...

Cited by (28)

Mining attention distribution paradigm: Discover gaze patterns and their association rules behind the visual image
2023, Computer Methods and Programs in Biomedicine
Citation Excerpt :
Eye movement data is believed to provide a credible and effective indicator of visual attention distribution, which is obtained by tracking the position of the eye's fixation point or the movement of the eyeball relative to the head. A large amount of eye movement data is applied to extract the characteristics of spatiotemporal fixation point density and consistency, then to establish the visual attention prediction model [6]. There are studies using eye-tracking data to analyze the eye movement patterns and attention distribution for diagnosis attention deficit related disease [7], cognitive function in patients with neurological disorders [8] and understanding aviation psychology [1].
Attention allocation reflects the way of humans filtering and organizing the information. On one hand, different task scenarios seriously affect human's rule of attention distribution, on the other hand, visual attention reflecting the cognitive and psychological process. Most of the previous studies on visual attention allocation are based on cognitive models, predicted models, or statistical analysis of eye movement data or visual images, however, these methods are inadequate to provide an inside view of gaze behavior to reveal the attention distribution pattern within scenario context. Moreover, they seldom study the association rules of these patterns. Therefore, we adopted the big data mining approach to discover the paradigm of visual attention distribution.
We applied the data mining method to extract the gaze patterns to discover the regularities of attention distribution behavior within the scenario context. The proposed method consists of three components, tasks scenario segmented and clustered, gaze pattern mining, and association rule of frequent pattern mining.
The proposed approach is tested on the operation platform. The complex operation task is simultaneously segmented and clustered with the TICC-based method and evaluated by the BCI index. The operator's eye movement frequent patterns and their association rule are discovered. The results demonstrate that our method can associate the eye-tracking data with the task-oriented scene data.
The proposed method provides the benefits of being able to explicitly express and quantitatively analyze people's visual attention patterns. The proposed method can not only be applied in the field of aerospace medicine and aviation psychology, but also can likely be applied to computer-aided diagnosis and follow-up tool for neurological disease and cognitive impairment related disease, such as ADHD (Attention Deficit Hyperactivity Disorder), neglect syndrome, social attention differences in ASD (Autism spectrum disorder).
Towards tacit knowledge mining within context: Visual cognitive graph model and eye movement image interpretation
2022, Computer Methods and Programs in Biomedicine
Visual attention is one of the most important brain cognitive functions, which filters the rich information of the outside world to ensure the efficient operation of limited cognitive resources. The underlying knowledge, i.e., tacit knowledge, hidden in the human attention allocation performances, is context-related and is hard to be expressed by experts, but it is essential for novice operator training and interaction system design. Traditional models of visual attention allocation and corresponding analysis methods seldomly involve task contextual information or present the tacit knowledge in an explicit and quantified way. Thus, it is challenging to pass on the expert’s tacit knowledge to the novice or utilize it to construct an interaction system by employing traditional methods. Therefore, this paper first proposes a new model called the visual cognitive graph model based on graph theory to model the visual attention allocation associated with the task context. Then, based on this graph model, utilize the data mining method to reveal attention patterns within context to quantitatively analyze the operator’s tacit knowledge during operation tasks. We introduced three physical quantities derived from graph theory to describe the tacit knowledge, which can be used directly to construct an interaction system or operator training. For example, discover the essential information within the task context, the relevant information affecting critical information, and the bridge information revealing the decision-making process. We tested the proposed method in the example of flight operation, the comparison results with the traditional eye movement graph model demonstrate that the proposed visual cognitive model can compromise the task context. The comparison results with the statistical analysis method demonstrate that our tacit knowledge mining method can reveal the underlying knowledge hidden in the visual information. Finally, we give practical applications in the examples of operator training guidance and adaptive interaction system. Our proposed method can explore more in-depth knowledge of visual information, such as the correlations of different obtained information and the way operator obtains information, most of which are even not noticed by operators themselves.
The effects of eye movements on the visual cortical responding variability based on a spiking network
2021, Neurocomputing
Neural responses in the primary visual cortex are variable. There are abundant experimental data characterizing how the visual cortical responding variability depends on the eye movements, but the underlying mechanism is still debatable. To explore this underlying mechanism, we take the eye movements into a two-layer plastic k-winner-take-all (k-WTA) spiking network. In our simulations, grayscale images serve as outside stimuli to the network and several example traces of eye movements are set specifically. To a same image, different eye movements induce the downstream responding variability of our network. These induced responses, in conjunction with the traces of eye movements, provide distinguishing reconstructions of the image. These results suggest a novel interpretation of how the eye movements affect the primary visual cortical responding variability.
Boundary-aware High-resolution Network with region enhancement for salient object detection
2020, Neurocomputing
Existing deep learning based salient object detection methods have achieved gratifying progress, however they still suffer from the coarse boundaries and incompleteness of salient objects. To alleviate these issues, this paper presents a novel method that supplements the low-resolution high-level semantic information to high-resolution low-level information with boundary optimization and region enhancement. Based on the parallel architecture, we design a Boundary-aware High-resolution Network (BHNet). First, BHNet maintains high resolution to extract features of the image at the first pathway, and the resolution of the other four pathways are lower, which are used to provide more semantic information for the first pathway. Second, in order to better integrate multi-level semantic information into the first pathway and improve the ability of the model to perceive salient objects, several Multi-path Channel Weight Modules (MCWMs) and Region Enhancement Modules (REMs) are further designed for corresponding blocks. Finally, we also propose a boundary loss function to guide the network to learn more detailed boundary information, which leads to accurate predictions with clear boundaries. Exhaustive evaluations on 6 popular datasets illustrate that the proposed method outperforms the state-of-the-art approaches due to its superior performance, nice generalization and powerful learning ability.
ChaboNet: Design of a deep CNN for prediction of visual saliency in natural video
2019, Journal of Visual Communication and Image Representation
Citation Excerpt :
The second phase provides the combination of different localized representations. While a significant effort has been already made for building saliency prediction models from still images with deep learning approach, very few have been built for video content with it [32]. Video has a supplementary dimension: the temporality that is expressed by apparent motion in the image plane.
Prediction of visual saliency in images and video is needed for video understanding, search and retrieval, coding, watermarking and other applications. The majority of prediction models are founded only on “bottom-up” features. Nevertheless, the “top-down” component of human visual attention becomes prevalent as human observers explore the visual scene. Visual saliency which is always a mix of bottom-up and top-down cues can be predicted on the basis of seen data. In this paper, a model of prediction of visual saliency in video on the basis of Deep convolutional neural networks (CNNs) is proposed. A Deep CNN architecture is designed. Various input channels for a CNN architecture are studied: using the known sensitivity of human visual system to residual motion, pixel colour values are completed with residual motion map. The latter is a normalized energy of residual motion in video frames with regard to the estimated global affine motion model. The experiments show that the choice of the input features for the Deep CNN depends on visual task: for highly dynamic content, the proposed model with residual motion is more efficient and gives decent results with relatively shallow Deep architecture.
A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos
2018, Neurocomputing
Citation Excerpt :
The finally feature vector for each pixel is an integrated feature with 240 dimensions. For regression purpose, a linear SVM is adopted for its simplicity and effectiveness [23] [102]. A five-fold cross-validation is employed to optimize the parameters.
Although research on detection of saliency and visual attention has been active over recent years, most of the existing work focuses on still image rather than video based saliency. In this paper, a deep learning based hybrid spatiotemporal saliency feature extraction framework is proposed for saliency detection from video footages. The deep learning model is used for the extraction of high-level features from raw video data, and they are then integrated with other high-level features. The deep learning network has been found extremely effective for extracting hidden features than that of conventional handcrafted methodology. The effectiveness for using hybrid high-level features for saliency detection in video is demonstrated in this work. Rather than using only one static image, the proposed deep learning model take several consecutive frames as input and both the spatial and temporal characteristics are considered when computing saliency maps. The efficacy of the proposed hybrid feature framework is evaluated by five databases with human gaze complex scenes. Experimental results show that the proposed model outperforms five other state-of-the-art video saliency detection approaches. In addition, the proposed framework is found useful for other video content based applications such as video highlights. As a result, a large movie clip dataset together with labeled video highlights is generated.

View all citing articles on Scopus

Liye Sun received his BS degree from the Northwestern Polytechnical University, China, in 2011. She is currently pursuing the Ph.D. degree at Northwestern Polytechnical University. Her research interests include computer vision and multimedia processing.

Xintao Hu received his M.S. and Ph.D degrees from the Northwestern Polytechnical University, China, in 2005 and 2011, respectively. He is currently a postdoc with School of Automation at NWPU. His research interests include computational brain imaging and its application in computer vision.

Jungong Han received his Ph.D. degree in Telecommunication and Information System from XiDian University, China, in 2004. From 2005 to 2010, he was with Signal Processing Systems group at the Technical University of Eindhoven (TU/e), The Netherlands. In December of 2010, he joined the Multi-Agent and Adaptive Computation group at the Centre for Mathematics and Computer Science (CWI) in Amsterdam. In July of 2012, he started a senior scientist position with Civolution technology in Eindhoven (a combining synergy of Philips Content Identification and Thomson STS). Dr. Han׳s research interests include multimedia content identification, multi-sensor data fusion, and computer vision. He has written and co-authored over 70 papers. He is an associate editor of Elsevier Neurocomputing.

Ling Shao received the B.Eng. degree in electronic engineering from the University of Science and Technology of China, Hefei, China, the M.Sc. degree in medical image analysis, and the Ph.D. (D.Phil.) degree in computer vision from the University of Oxford, Oxford, U.K.

He is currently a Senior Lecturer (Associate Professor) with the Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield, U.K., and is a Guest Professor with Nanjing University of Information Science and Technology, China. Before joining Sheffield University, he was a Senior Scientist with Philips Research, Eindhoven, The Netherlands. He has authored/co-authored over 100 journal and conference papers and holds over 10 patents. His current research interests include computer vision, pattern recognition, and video processing.

Dr. Shao is an Associate Editor of IEEE Transactions on Cybernetics, Neurocomputing, the International Journal of Image and Graphics, and the EURASIP Journal on Advances in Signal Processing, and has edited several special issues for journals of IEEE, Elsevier and Springer. He has organized several workshops with top conferences, such as ICCV, ACM Multimedia and ECCV. He has been serving as a Program Committee member for many international conferences, including ICCV, CVPR, ECCV, ACM MM, BMVC, and so on. He is also a Fellow of the British Computer Society.

View full text

Spatial and temporal visual attention prediction in videos using eye movement data

Abstract

Introduction

Section snippets

Related works

Computation of SVA

Computation of TVA

Experimental results

One application of TVA: whether movie trailers are interesting

Conclusion

Acknowledgments

Pattern Recogn.

Neurocomputing

Vision Res.

Neurocomputing

Vis. Res.

Unsupervised extraction of visual attention objects in color images

IEEE Trans. Circ. Syst. Video Technol.

A generic framework of user attention model and its application in video summarization

IEEE Trans. Multimedia

Improved seam carving for video retargeting

ACM Trans. Graphic

The integration of parallel and serial processing mechanisms in visual search: evidence from eye movement recordings

Eur. J. Neurosci.

Objects predict fixations better than early saliency

J. Vision

Eye-tracking database for a set of standard video sequences

IEEE Trans. Image. Process.

A model of saliency-based visual attention for rapid scene analysis

IEEE Trans. Pattern. Anal.

Modelling spatiotemporal saliency to predict gaze direction for short videos

Int. J. Comput. Vis.

Object segmentation from consumer videos: a unified framework based on visual attention

IEEE Trans. Consum. Electron.

Static and space-time visual saliency detection by self-resemblance

J. Vis.

Graph-based visual saliency

Adv. Neural Inf. Process. Syst.