What do eyes reveal about the mind?: Algorithmic inference of search targets from fixations
Introduction
Eye movements can reveal a wealth of information about the complex cognitive states of the mind. They carry information that is diagnostic of the task an observer is trying to perform [16], [85], [20], [24], [35], [2], [38], [6]. Yarbus, in his seminal work in 1967, reported that observers׳ fixation patterns during free viewing of a painting were dramatically different than when different questions were given [85]. While the allocation of attention is often task-driven, it can also be guided by bottom-up and stimulus-driven cues [80], [42], [40], [64], [39], [4], [3], [10]. Normal vision employs both processes simultaneously to control overt and covert shifts of attention.
There is a rich collection of literature that discusses the role of oculomotor behavior in tasks as diverse as reading [69], [22], pattern copying [1], portrait painting [55], visual search [80], [84], [88], tea making [44], sandwich making [32], fencing [29], cricket [46], squash [21], billiards [23], juggling [43], activity recognition [15], [50], [65], [25], and game playing [9], [7], [11]. See [47] for a review of eye movements in natural vision tasks. Some general underlying principles of gaze guidance have been discovered. For example, it is known that eye movements follow the road tangent in driving [45], some saccades occur to avoid obstacles (predictive saccades in walking [54]), and eye movements are sensitive to the value of visual items [59]. Eye movements are also indicators of abstract thought processes, for instance in arithmetic and geometric problem solving [18], list sorting, and mental imagery [53]. These findings highlight the intricate links between the mind, the body׳s actions, and the world around us. This active aspect of vision and attention has been extensively investigated in the context of natural behavior. Please see [1], [33], [78], [74], [57], [48], [47], [39], [5] for reviews.
Some computational models have been proposed to quantify gaze behavior, though their generalizations across tasks remain limited. Examples of top-down models of gaze control include HMM models of fixation prediction in reading (E–Z reader model by Reichle et al. [70], Mr. Chips model by Legge et al. [49]), a model of minimizing local uncertainty in object classification [71], a reward maximization framework to coordinate basic visio-motor routines to perform a complex task using reinforcement learning [77], Bayesian models of gaze control (e.g., [86], [72], [8]), and pattern classification models [9], [7]. In addition, a myriad of bottom-up models exist for predicting where observers look when engaged in free-viewing of pictures of natural scenes (see the review by Borji and Itti [5]).
Despite the enormous amount of past research on understanding the mechanisms of gaze control, less systematic effort has been made so far to predict intents from fixations. The majority of studies have qualitatively analyzed the difference between eye movement patterns of observers viewing natural scenes under different questions (e.g., [24], [6]). Some researchers, conducting quantitative analyses, have reported that it is possible to decode the task from eye movements while some others have argued against it. For example, Henderson et al. [35] recorded eye movements of 12 participants while they were engaged in four tasks over 196 scenes and 140 texts: scene search, scene memorization, reading, and pseudo reading. They showed that the viewing tasks were highly distinguishable based on eye movement features in a four-way classification (decoding accuracy above 80%). In contrary, Greene et al. [28] did an experiment in which they recorded eye movements of observers when viewing scenes under four questions: memorize the picture, determine the decade in which the picture was taken, determine how well the people in the picture know each other, and determine the wealth of the people in the picture. They were able to decode image and observer׳s identity from eye movements above chance level, but failed to predict the viewer׳s task (see Fig. 4 in Greene et al.׳s paper). Borji and Itti [6] were later able to decode observers׳ task on this data as well as on the original question of Yarbus. Several successful attempts have been made in the past to learn about human cognition such as predicting search targets [68], [30], decoding stimulus category [31], [60], [12], predicting relative magnitude of a randomly chosen number by a person [51], predicting events [65], [15], predicting an observer׳s category of clinical condition [81], and task decoding [85].
The current study addresses the challenging problem of intent decoding – predicting what target an observer is looking for from his eye movements. Some scientific findings show promising directions in this regard. For example, it is known that during visual search, our attention and eye movements are biased by visual information resembling the target (e.g., [56], [66], [84], [68]), causing the image statistics near our fixated positions to be systematically influenced by basic visual features of the target ([68], [66]). One study also found that the type of object sought, of two possible categories, can be inferred from search statistics [87]. However, the existing approaches have not considered strategies beyond using elementary search statistics [68]. Furthermore, current methods have not been tested for target decoding on natural scenes.
Our work focuses on designing powerful search target inference algorithms from eye movements recorded during visual search. Visual search is an important task as it is one of the main ingredients of complex daily life tasks. Two important application domains of such target prediction algorithms are interface design (e.g., smart webpages) and wearable visual technologies. If target inference becomes possible for a large set of candidate objects, a new generation of smart, gaze-controlled human–computer interfaces could become reality [36], [75]. Gaining information about an interface user׳s object of interest, even in its absence, would be invaluable for the interface to provide the most relevant feedback to its user. In a broader perspective, our study shows directions for efficient intent decoding from eye movements.
Section snippets
Visual search experiments
We conduct three experiments to explore the potential of algorithmically inferring the search target from a searcher׳s visited patterns. In the first two experiments, we choose a random-dot search paradigm to eliminate contextual influences on visual search (see Fig. 1 for example scenes). The proposed techniques could also be applied to the local feature vectors of any type of display.
Search in natural scenes is different from looking for targets in random-dot patterns since several other
Algorithms for inferring search targets
Our development and evaluation of several inferential algorithms on synthetic patterns resulted in the discovery of two particularly powerful mechanisms, whose combination outperformed all other methods (including a baseline method from [68]) over the first two experiments without modifying their parameters between experiments. We also checked the generality of our results using pattern classifiers (Support Vector Machines and Naive Bayes) over Experiment 3 and sought whether a top-down biasing
Experiments 1 and 2: target inference on synthetic patterns
We first analyze the degree to which eye movements convey information regarding the search target compared to randomly fixated locations. Let represent the similarity between the fixated pattern ti and search target T, where is the Hamming distance. In Fig. 3a, we show the similarity between the most frequent fixated patterns (excluding first and last fixations) and the target. Taking the cumulative mean of the similarity measure (Fig. 3a, right) shows that highly fixated
Discussion
We reported algorithms for search target inference from eye movements in synthetic and natural scenes. The results provide insight into both the way humans search for complex patterns and the most promising approaches to extracting intent information from fixation data. With regard to human search processes, it seems that complex target patterns are not matched with local display areas as a whole but that search is guided by subpatterns (in alignment with findings from [68]). Only target
Conclusion
The present data suggest that the mechanisms underlying the weighted pattern voting algorithm are robust enough for a useful target estimation in a variety of human–computer interfaces. Our target inference algorithms can be adapted to various display types, since image filters commonly used in computer vision and behavioral studies (e.g., [88]) can transform any display into a matrix of feature vectors. Moreover, the current data advocate that the future designers of smart, gaze-controlled
Acknowledgments
A.B. was supported by the National Science Foundation (Grant number CMMI-1235539), the Army Research Office (W911NF-11-1-0046 and W911NF-12-1-0433), and U.S. Army (W81XWH-10-2-0076). M.P. was supported by Grant number R15EY017988 from the National Eye Institute. The authors would like to thank Melissa Le-Hoa Võ and Jeremy Wolf for sharing their data with us. Thanks also to reviewers for their valuable comments. Our code and data are publicly available at http://ilab.usc.edu/borji/Resources.html.
Ali Borji received the B.S. and M.S. degrees in computer engineering from the Petroleum University of Technology, Tehran, Iran, 2001 and Shiraz University, Shiraz, Iran, 2004, respectively. He received the Ph.D. degree in computational neurosciences from the Institute for Studies in Fundamental Sciences (IPM) in Tehran, 2009. He then spent a year at the University of Bonn as a postdoc. He has been a postdoctoral scholar at iLab, University of Southern California, Los Angeles, since March 2010.
References (88)
- et al.
What stands out in a scene? A study of human explicit saliency judgment
Vis. Res.
(2013) Visual attentionthe past 25 years
Vis. Res.
(2011)- et al.
Reconsidering Yarbusa failure to predict observers׳ task from eye movement patterns
Vis. Res.
(2012) - et al.
Eye movements in natural behavior
Trends Cogn. Sci.
(2005) Human gaze control during real-world scene perception
Trends Cogn. Sci.
(2003)- et al.
Semantic guidance of eye movements in real-world scenes
Vis. Res.
(2011) Eye movements and the control of actions in everyday life
Prog. Retin. Eye Res.
(2006)- et al.
In what ways do eye movements contribute to everyday activities?
Vis. Res.
(2001) - et al.
A comparison of selected simple supervised learning algorithms to predict driver intent based on gaze data
Neurocomputing
(2013) - et al.
Modeling the influence of task on attention
Vis. Res.
(2005)
Modeling the role of salience in the allocation of overt visual attention
Vis. Res.
Saccadic selectivity in complex visual search displays
Vis. Res.
A feature integration theory of attention
Cogn. Psychol.
Visual search and attentiona signal detection theory approach
Neuron
The interplay of episodic and semantic memory in guiding repeated search in scenes
Cognition
Memory representations in natural tasks
J. Cognit. Neurosci.
Investigating task-dependent top-down effects on overt visual attention
J. Vis.
State-of-the-art in modeling visual attention
IEEE Trans. Pattern Anal. Mach. Intell. (PAMI)
Defending Yarbuseye movements reveal observers׳ task
J. Vis.
Quantitative analysis of human-model agreement in visual saliency modelinga comparative study
IEEE Trans. Image Process.
What/Where to look next? Modeling top-down Visual Attention in Complex Interactive Environments
IEEE T. Syst. Man Cybern. Syst.
Optimal attentional modulation of a neural population
Front. Comput. Neurosci.
Eye movement analysis for activity recognition using electrooculography
IEEE Trans. Pattern Anal. Mach. Intell.
How People Look at Pictures
Support-vector networks
Mach. Learn.
Visuomotor characterization of eye movements in a drawing task
Vis. Res.
Viewing task influences eye movement control during active scene perception
J. Vis.
Predictive eye movements in squash
J. Vis.
Word ambiguity and the optimal viewing position in reading
Vis. Res.
Spotting expertise in the eyesbilliards knowledge as revealed by gaze shifts in a dynamic visual prediction task
J. Vis.
Top-down control of eye movements: Yarbus revisited
Vis. Cognit.
On the roles of eye gaze and head dynamics in predicting driver׳s intent to change lanes
IEEE Trans. Intell. Transp. Syst.
Modelling search for people in 900 scenesa combined source model of eye guidance
Vis. Cognit.
Interesting objects are visually salient
J. Vis.
Visual perception in fencingdo the eye movements of fencers represent their information pickup?
Atten., Percept., Psychophys.
A computational model for task inference in visual search
J. Vis.
Visual memory and motor planning in a natural task
J. Vis.
Cited by (36)
Applying machine learning to dissociate between stroke patients and healthy controls using eye movement features obtained from a virtual reality task
2022, HeliyonCitation Excerpt :Machine learning approaches, in comparison to more conventional statistical approaches, will not only classify cases, they can also be applied when data is normally distributed and outliers are relatively common, as can be the case with patient data. These methods, using eye movement measures, have been previously used to differentiate between various neurodegenerative and neurodevelopmental disorders (Carette et al., 2019; Lagun et al., 2011; Pusiol et al., 2016) and to decode the task of the observer (Borji et al., 2015; Borji and Itti, 2014; Kootstra et al., 2020). Given the recent rise of affordable and accessible eye tracking hardware, there is great promise in applying machine learning techniques to improve neuropsychological assessment using non-invasive, sensitive measures.
Toward an audiovisual attention model for multimodal video content
2017, NeurocomputingCitation Excerpt :To this end, fixations of the average subject have been calculated by accumulating and normalizing by the number of participants. Thus, the value of each pixel of the map demonstrates its ability to attract attention i.e. the higher this value is, the more attractive this area will be [39]. We end up with two ground truths: one without audio, constructed using human fixations in visual conditions and the other with audio, constructed using human fixations in audiovisual conditions (stimuli presented with their soundtrack).
Theoretical perspectives on active sensing
2016, Current Opinion in Behavioral SciencesCitation Excerpt :Yarbus, in his pioneering work in vision, showed that even when viewing the same image, humans employed distinctively different eye movements when required to make different inferences about the image [59]. Recent studies confirmed such task dependence by showing that it was possible to predict the task simply from the recording of an individual's eye movements in the Yarbus setting [60,61], or the target they were looking for in a search task [62], or the moment-by-moment goal they were trying to achieve in a game setting [63]. Another requirement for active sensing is that different sensor and actuator properties should lead to different planning behaviors for a given task.
Predicting task from eye movements: On the importance of spatial distribution, dynamics, and image features
2016, NeurocomputingCitation Excerpt :An alternative to considering raw fixation positions given their relative sparsity (in pixel terms) is considering a more continuous density map derived from the raw fixations. This may be produced by way of convolution of the fixation map with a Gaussian profile [5], and/or sub-sampling to produce a coarse-grained continuous spatial density map for task prediction [2]. The first set of features we have considered are spatial densities of fixations.
The impact of visual and motor space size on gaze-based target selection
2024, Australian Journal of PsychologyImproving Intention Detection in Single-Trial Classification Through Fusion of EEG and Eye-Tracker Data
2023, IEEE Transactions on Human-Machine Systems
Ali Borji received the B.S. and M.S. degrees in computer engineering from the Petroleum University of Technology, Tehran, Iran, 2001 and Shiraz University, Shiraz, Iran, 2004, respectively. He received the Ph.D. degree in computational neurosciences from the Institute for Studies in Fundamental Sciences (IPM) in Tehran, 2009. He then spent a year at the University of Bonn as a postdoc. He has been a postdoctoral scholar at iLab, University of Southern California, Los Angeles, since March 2010. His research interests include computer vision, machine learning, and neurosciences with particular emphasis on visual attention, visual search, active learning, scene and object recognition, and biologically plausible vision models.
Andreas Lennartz worked as a visiting masters student at the Department of Computer Science, University of Massachusetts at Boston in 2006–2007. He is currently with System development team DWH at Air Berlin PLC & Co. KG aviation.
Marc Pomplun is an Associate Professor of Computer Science at the University of Massachusetts Boston, where he joined the faculty in 2003. He received his M.S. in Computer Science at Bielefeld University, Bielefeld, Germany, in 1994. He obtained his Ph.D. in Computer Science under Professor Dr. Helge Ritter at Bielefeld University, Bielefeld, Germany. He was a post-doctoral fellow in the Department of Psychology at the University of Toronto. His research interests include human vision, computer vision and Human–computer interaction.
- 1
Tel.: +1 617 287 6443.