A number of studies investigate how natural image statistics influence the locations of human fixations [
28‐
30]. Despite the variety of the considered low-level image features, most researchers agree that contrast distribution plays a significant role in guidance of eye movements. Usually, the local contrast is defined as the standard deviation of the image intensities within some small region, divided by the mean intensity within that region, i.e., the local root mean square RMS contrast. However, as the distribution of natural images is non-Gaussian [
25,
37], in this paper, we follow [
14,
17] and model image contrast with the Weibull distribution. Figure
2 illustrates that the two-parameter Weibull distribution fits the local contrast statistics adequately well. Baddeley and Tatler [
1] argue that high-frequency edges turn out to have most impact on fixation prediction, whereas contrast is highly correlated. The next most important feature in their analysis is low-frequency edges. Geusebroek and Smeulders [
14] show both contrast and edge frequency to be simultaneously captured by the Weibull distribution. It allows to combine these two image regularities in an elegant way taking into account the strong correlation between them. In our analysis, we investigate a joint distribution of the local contrast and the edge frequency and, thereby, combine low-level image features that are known to be the most powerful in fixation prediction. Moreover, we do not separate high- and low-frequency edges. Instead, we use the minimal reliable scale selection principle [
30] as discussed in Sect.
2.1.2 and implicitly consider edges over the available frequency range all together. Inspired by the centre-surround receptive field design of neurons in the retina [
18], several successful saliency models are based on comparison of centre-surround regions at each image location [
4,
5,
10,
15,
19,
23,
36]. Intuitively, image locations which deviate from their surrounding should be salient. Itti et al. [
19] consider visual features salient if they have different brightness, colour or orientation than the surrounding features. Overall, their model combines a total of 42 feature maps into a single saliency map. In contrast, we do not make any assumption about patterns in the spatial structure of feature responses and base our model on comparison of local image statistics with statistics learned from fixation and non-fixation regions. Table
1 and Fig.
5 show that the proposed Weibull method outperforms the method by Itti et al. It might indicate the advantage of direct training of the model parameters from an eye movement data set. Moreover, the higher performance of our method might be due to the explicit modelling of the correlation between image features. Bruce and Tsotsos [
4] follow an information-theoretic approach and use
information maximization sampling to discriminate centre-surround regions. They calculate Shannon’s self-information based on the likelihood of the local image content in the centre region given the image content of the surround. Regions with unexpected content in comparison with their surrounding are more informative, and thus salient. As shown in Table
1 and Fig.
5, our model achieves a performance comparable with the elaborate approach by Bruce and Tsotsos, while we use as few as two parameters learned from a set of images with associated eye movements. We have explored the generalization of the proposed method by considering the two eye movements data sets: a standard data set from [
4] with urban images, and an artistic photo collection with diverse context from National Geographic wallpapers. Examples of images from both data sets are shown in Fig.
3. Table
2 and Fig.
6 show that training the parameters of our Weibull method on the National Geographic data set and testing it on the Bruce&Tsotsos data gives the same results as both training and testing on the Bruce&Tsotsos data. However, for the National Geographic data set, there is a small drop in performance when the parameters of the Weibull model are trained on the Bruce&Tsotsos data instead of the National Geographic. We attribute this to the higher variation in image content from the National Geographic data. We conclude that the proposed model has good generalization power when the variation in the training data set is sufficiently diverse.