Elsevier

Neurocomputing

Volume 149, Part B, 3 February 2015, Pages 788-799
Neurocomputing

What do eyes reveal about the mind?: Algorithmic inference of search targets from fixations

https://doi.org/10.1016/j.neucom.2014.07.055Get rights and content

Highlights

  • Providing a unified theoretical framework for intent decoding using eye movements.

  • Proposing two new algorithms for search target inference from fixations.

  • Studying the impact of target complexity in search performance and target inference.

  • Sharing a large collection of code and data to promote future research in this area.

Abstract

We address the question of inferring the search target from fixation behavior in visual search. Such inference is possible since during search, our attention and gaze are guided toward visual features similar to those in the search target. We strive to answer two fundamental questions: what are the most powerful algorithmic principles for this task, and how does their performance depend on the amount of available eye movement data and the complexity of the target objects? In the first two experiments, we choose a random-dot search paradigm to eliminate contextual influences on search. We present an algorithm that correctly infers the target pattern up to 50 times as often as a previously employed method and promises sufficient power and robustness for interface control. Moreover, the current data suggest a principal limitation of target inference that is crucial for interface design: if the target pattern exceeds a certain spatial complexity level, only a subpattern tends to guide the observers׳ eye movements, which drastically impairs target inference. In the third experiment, we show that it is possible to predict search targets in natural scenes using pattern classifiers and classic computer vision features significantly above chance. The availability of compelling inferential algorithms could initiate a new generation of smart, gaze-controlled interfaces and wearable visual technologies that deduce from their users׳ eye movements the visual information for which they are looking. In a broader perspective, our study shows directions for efficient intent decoding from eye movements.

Introduction

Eye movements can reveal a wealth of information about the complex cognitive states of the mind. They carry information that is diagnostic of the task an observer is trying to perform [16], [85], [20], [24], [35], [2], [38], [6]. Yarbus, in his seminal work in 1967, reported that observers׳ fixation patterns during free viewing of a painting were dramatically different than when different questions were given [85]. While the allocation of attention is often task-driven, it can also be guided by bottom-up and stimulus-driven cues [80], [42], [40], [64], [39], [4], [3], [10]. Normal vision employs both processes simultaneously to control overt and covert shifts of attention.

There is a rich collection of literature that discusses the role of oculomotor behavior in tasks as diverse as reading [69], [22], pattern copying [1], portrait painting [55], visual search [80], [84], [88], tea making [44], sandwich making [32], fencing [29], cricket [46], squash [21], billiards [23], juggling [43], activity recognition [15], [50], [65], [25], and game playing [9], [7], [11]. See [47] for a review of eye movements in natural vision tasks. Some general underlying principles of gaze guidance have been discovered. For example, it is known that eye movements follow the road tangent in driving [45], some saccades occur to avoid obstacles (predictive saccades in walking [54]), and eye movements are sensitive to the value of visual items [59]. Eye movements are also indicators of abstract thought processes, for instance in arithmetic and geometric problem solving [18], list sorting, and mental imagery [53]. These findings highlight the intricate links between the mind, the body׳s actions, and the world around us. This active aspect of vision and attention has been extensively investigated in the context of natural behavior. Please see [1], [33], [78], [74], [57], [48], [47], [39], [5] for reviews.

Some computational models have been proposed to quantify gaze behavior, though their generalizations across tasks remain limited. Examples of top-down models of gaze control include HMM models of fixation prediction in reading (E–Z reader model by Reichle et al. [70], Mr. Chips model by Legge et al. [49]), a model of minimizing local uncertainty in object classification [71], a reward maximization framework to coordinate basic visio-motor routines to perform a complex task using reinforcement learning [77], Bayesian models of gaze control (e.g., [86], [72], [8]), and pattern classification models [9], [7]. In addition, a myriad of bottom-up models exist for predicting where observers look when engaged in free-viewing of pictures of natural scenes (see the review by Borji and Itti [5]).

Despite the enormous amount of past research on understanding the mechanisms of gaze control, less systematic effort has been made so far to predict intents from fixations. The majority of studies have qualitatively analyzed the difference between eye movement patterns of observers viewing natural scenes under different questions (e.g., [24], [6]). Some researchers, conducting quantitative analyses, have reported that it is possible to decode the task from eye movements while some others have argued against it. For example, Henderson et al. [35] recorded eye movements of 12 participants while they were engaged in four tasks over 196 scenes and 140 texts: scene search, scene memorization, reading, and pseudo reading. They showed that the viewing tasks were highly distinguishable based on eye movement features in a four-way classification (decoding accuracy above 80%). In contrary, Greene et al. [28] did an experiment in which they recorded eye movements of observers when viewing scenes under four questions: memorize the picture, determine the decade in which the picture was taken, determine how well the people in the picture know each other, and determine the wealth of the people in the picture. They were able to decode image and observer׳s identity from eye movements above chance level, but failed to predict the viewer׳s task (see Fig. 4 in Greene et al.׳s paper). Borji and Itti [6] were later able to decode observers׳ task on this data as well as on the original question of Yarbus. Several successful attempts have been made in the past to learn about human cognition such as predicting search targets [68], [30], decoding stimulus category [31], [60], [12], predicting relative magnitude of a randomly chosen number by a person [51], predicting events [65], [15], predicting an observer׳s category of clinical condition [81], and task decoding [85].

The current study addresses the challenging problem of intent decoding – predicting what target an observer is looking for from his eye movements. Some scientific findings show promising directions in this regard. For example, it is known that during visual search, our attention and eye movements are biased by visual information resembling the target (e.g., [56], [66], [84], [68]), causing the image statistics near our fixated positions to be systematically influenced by basic visual features of the target ([68], [66]). One study also found that the type of object sought, of two possible categories, can be inferred from search statistics [87]. However, the existing approaches have not considered strategies beyond using elementary search statistics [68]. Furthermore, current methods have not been tested for target decoding on natural scenes.

Our work focuses on designing powerful search target inference algorithms from eye movements recorded during visual search. Visual search is an important task as it is one of the main ingredients of complex daily life tasks. Two important application domains of such target prediction algorithms are interface design (e.g., smart webpages) and wearable visual technologies. If target inference becomes possible for a large set of candidate objects, a new generation of smart, gaze-controlled human–computer interfaces could become reality [36], [75]. Gaining information about an interface user׳s object of interest, even in its absence, would be invaluable for the interface to provide the most relevant feedback to its user. In a broader perspective, our study shows directions for efficient intent decoding from eye movements.

Section snippets

Visual search experiments

We conduct three experiments to explore the potential of algorithmically inferring the search target from a searcher׳s visited patterns. In the first two experiments, we choose a random-dot search paradigm to eliminate contextual influences on visual search (see Fig. 1 for example scenes). The proposed techniques could also be applied to the local feature vectors of any type of display.

Search in natural scenes is different from looking for targets in random-dot patterns since several other

Algorithms for inferring search targets

Our development and evaluation of several inferential algorithms on synthetic patterns resulted in the discovery of two particularly powerful mechanisms, whose combination outperformed all other methods (including a baseline method from [68]) over the first two experiments without modifying their parameters between experiments. We also checked the generality of our results using pattern classifiers (Support Vector Machines and Naive Bayes) over Experiment 3 and sought whether a top-down biasing

Experiments 1 and 2: target inference on synthetic patterns

We first analyze the degree to which eye movements convey information regarding the search target compared to randomly fixated locations. Let s(ti,T)=1h(ti,T) represent the similarity between the fixated pattern ti and search target T, where h(,) is the Hamming distance. In Fig. 3a, we show the similarity between the most frequent fixated patterns (excluding first and last fixations) and the target. Taking the cumulative mean of the similarity measure (Fig. 3a, right) shows that highly fixated

Discussion

We reported algorithms for search target inference from eye movements in synthetic and natural scenes. The results provide insight into both the way humans search for complex patterns and the most promising approaches to extracting intent information from fixation data. With regard to human search processes, it seems that complex target patterns are not matched with local display areas as a whole but that search is guided by subpatterns (in alignment with findings from [68]). Only target

Conclusion

The present data suggest that the mechanisms underlying the weighted pattern voting algorithm are robust enough for a useful target estimation in a variety of human–computer interfaces. Our target inference algorithms can be adapted to various display types, since image filters commonly used in computer vision and behavioral studies (e.g., [88]) can transform any display into a matrix of feature vectors. Moreover, the current data advocate that the future designers of smart, gaze-controlled

Acknowledgments

A.B. was supported by the National Science Foundation (Grant number CMMI-1235539), the Army Research Office (W911NF-11-1-0046 and W911NF-12-1-0433), and U.S. Army (W81XWH-10-2-0076). M.P. was supported by Grant number R15EY017988 from the National Eye Institute. The authors would like to thank Melissa Le-Hoa Võ and Jeremy Wolf for sharing their data with us. Thanks also to reviewers for their valuable comments. Our code and data are publicly available at http://ilab.usc.edu/borji/Resources.html.

Ali Borji received the B.S. and M.S. degrees in computer engineering from the Petroleum University of Technology, Tehran, Iran, 2001 and Shiraz University, Shiraz, Iran, 2004, respectively. He received the Ph.D. degree in computational neurosciences from the Institute for Studies in Fundamental Sciences (IPM) in Tehran, 2009. He then spent a year at the University of Bonn as a postdoc. He has been a postdoctoral scholar at iLab, University of Southern California, Los Angeles, since March 2010.

References (88)

  • D. Parkhurst et al.

    Modeling the role of salience in the allocation of overt visual attention

    Vis. Res.

    (2002)
  • M. Pomplun

    Saccadic selectivity in complex visual search displays

    Vis. Res.

    (2006)
  • A.M. Treisman et al.

    A feature integration theory of attention

    Cogn. Psychol.

    (1980)
  • Preeti Verghese

    Visual search and attentiona signal detection theory approach

    Neuron

    (2001)
  • M.L.H. et al.

    The interplay of episodic and semantic memory in guiding repeated search in scenes

    Cognition

    (2013)
  • D. Ballard et al.

    Memory representations in natural tasks

    J. Cognit. Neurosci.

    (1995)
  • T. Betz et al.

    Investigating task-dependent top-down effects on overt visual attention

    J. Vis.

    (2010)
  • A. Borji, Boosting bottom-up and top-down visual features for saliency estimation, in: CVPR,...
  • A. Borji, L. Itti, Exploiting local and global patch rarities for saliency detection, in: CVPR,...
  • A. Borji et al.

    State-of-the-art in modeling visual attention

    IEEE Trans. Pattern Anal. Mach. Intell. (PAMI)

    (2013)
  • A. Borji et al.

    Defending Yarbuseye movements reveal observers׳ task

    J. Vis.

    (2014)
  • A. Borji, D.N. Sihite, L. Itti, Computational modeling of top-down visual attention in interactive environments, in:...
  • A. Borji, D.N. Sihite, L. Itti, An object-based bayesian framework for top-down visual attention, in: Proceedings of...
  • A. Borji, D.N. Sihite, L. Itti, Probabilistic learning of task-specific visual attention, in: IEEE Conference on...
  • A. Borji et al.

    Quantitative analysis of human-model agreement in visual saliency modelinga comparative study

    IEEE Trans. Image Process.

    (2012)
  • A. Borji et al.

    What/Where to look next? Modeling top-down Visual Attention in Complex Interactive Environments

    IEEE T. Syst. Man Cybern. Syst.

    (2014)
  • A. Borji, H.R. Tavakoli, D.N. Sihite, L. Itti, Analysis of scores, datasets, and models in visual saliency modeling,...
  • Ali Borji et al.

    Optimal attentional modulation of a neural population

    Front. Comput. Neurosci.

    (2014)
  • A. Bulling et al.

    Eye movement analysis for activity recognition using electrooculography

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • G.T. Buswell

    How People Look at Pictures

    (1935)
  • C. Corinna et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • R.C. Cagli et al.

    Visuomotor characterization of eye movements in a drawing task

    Vis. Res.

    (2009)
  • M.S. Castelhano et al.

    Viewing task influences eye movement control during active scene perception

    J. Vis.

    (2009)
  • K. Chajka et al.

    Predictive eye movements in squash

    J. Vis.

    (2006)
  • J. Clark et al.

    Word ambiguity and the optimal viewing position in reading

    Vis. Res.

    (1998)
  • S. Crespi et al.

    Spotting expertise in the eyesbilliards knowledge as revealed by gaze shifts in a dynamic visual prediction task

    J. Vis.

    (2012)
  • M. DeAngelus et al.

    Top-down control of eye movements: Yarbus revisited

    Vis. Cognit.

    (2009)
  • A. Doshi et al.

    On the roles of eye gaze and head dynamics in predicting driver׳s intent to change lanes

    IEEE Trans. Intell. Transp. Syst.

    (2009)
  • K.A. Ehinger et al.

    Modelling search for people in 900 scenesa combined source model of eye guidance

    Vis. Cognit.

    (2009)
  • Lior Elazary et al.

    Interesting objects are visually salient

    J. Vis.

    (2008)
  • N. Hagemann et al.

    Visual perception in fencingdo the eye movements of fencers represent their information pickup?

    Atten., Percept., Psychophys.

    (2010)
  • A. Haji-Abolhassani et al.

    A computational model for task inference in visual search

    J. Vis.

    (2013)
  • J. Harel, C. Moran, A. Huth, W. Einhäuser, C. Koch, Decoding what people see from where they look: predicting visual...
  • Mary M. Hayhoe et al.

    Visual memory and motor planning in a natural task

    J. Vis.

    (2003)
  • Cited by (36)

    • Applying machine learning to dissociate between stroke patients and healthy controls using eye movement features obtained from a virtual reality task

      2022, Heliyon
      Citation Excerpt :

      Machine learning approaches, in comparison to more conventional statistical approaches, will not only classify cases, they can also be applied when data is normally distributed and outliers are relatively common, as can be the case with patient data. These methods, using eye movement measures, have been previously used to differentiate between various neurodegenerative and neurodevelopmental disorders (Carette et al., 2019; Lagun et al., 2011; Pusiol et al., 2016) and to decode the task of the observer (Borji et al., 2015; Borji and Itti, 2014; Kootstra et al., 2020). Given the recent rise of affordable and accessible eye tracking hardware, there is great promise in applying machine learning techniques to improve neuropsychological assessment using non-invasive, sensitive measures.

    • Toward an audiovisual attention model for multimodal video content

      2017, Neurocomputing
      Citation Excerpt :

      To this end, fixations of the average subject have been calculated by accumulating and normalizing by the number of participants. Thus, the value of each pixel of the map demonstrates its ability to attract attention i.e. the higher this value is, the more attractive this area will be [39]. We end up with two ground truths: one without audio, constructed using human fixations in visual conditions and the other with audio, constructed using human fixations in audiovisual conditions (stimuli presented with their soundtrack).

    • Theoretical perspectives on active sensing

      2016, Current Opinion in Behavioral Sciences
      Citation Excerpt :

      Yarbus, in his pioneering work in vision, showed that even when viewing the same image, humans employed distinctively different eye movements when required to make different inferences about the image [59]. Recent studies confirmed such task dependence by showing that it was possible to predict the task simply from the recording of an individual's eye movements in the Yarbus setting [60,61], or the target they were looking for in a search task [62], or the moment-by-moment goal they were trying to achieve in a game setting [63]. Another requirement for active sensing is that different sensor and actuator properties should lead to different planning behaviors for a given task.

    • Predicting task from eye movements: On the importance of spatial distribution, dynamics, and image features

      2016, Neurocomputing
      Citation Excerpt :

      An alternative to considering raw fixation positions given their relative sparsity (in pixel terms) is considering a more continuous density map derived from the raw fixations. This may be produced by way of convolution of the fixation map with a Gaussian profile [5], and/or sub-sampling to produce a coarse-grained continuous spatial density map for task prediction [2]. The first set of features we have considered are spatial densities of fixations.

    View all citing articles on Scopus

    Ali Borji received the B.S. and M.S. degrees in computer engineering from the Petroleum University of Technology, Tehran, Iran, 2001 and Shiraz University, Shiraz, Iran, 2004, respectively. He received the Ph.D. degree in computational neurosciences from the Institute for Studies in Fundamental Sciences (IPM) in Tehran, 2009. He then spent a year at the University of Bonn as a postdoc. He has been a postdoctoral scholar at iLab, University of Southern California, Los Angeles, since March 2010. His research interests include computer vision, machine learning, and neurosciences with particular emphasis on visual attention, visual search, active learning, scene and object recognition, and biologically plausible vision models.

    Andreas Lennartz worked as a visiting masters student at the Department of Computer Science, University of Massachusetts at Boston in 2006–2007. He is currently with System development team DWH at Air Berlin PLC & Co. KG aviation.

    Marc Pomplun is an Associate Professor of Computer Science at the University of Massachusetts Boston, where he joined the faculty in 2003. He received his M.S. in Computer Science at Bielefeld University, Bielefeld, Germany, in 1994. He obtained his Ph.D. in Computer Science under Professor Dr. Helge Ritter at Bielefeld University, Bielefeld, Germany. He was a post-doctoral fellow in the Department of Psychology at the University of Toronto. His research interests include human vision, computer vision and Human–computer interaction.

    1

    Tel.: +1 617 287 6443.

    View full text