This article delves into the critical role of spatial background restrictions in enhancing the accuracy of species distribution models, with a particular focus on Rattus species. It examines how environmental changes and ecological disruptions have necessitated the development of comprehensive strategies to preserve biodiversity. The study highlights the limitations of species distribution models, particularly sampling bias, and explores various methods to mitigate this issue. Key approaches include environmental filtering, spatial filtering of occurrence data, and incorporating covariates related to sampling effort. The article also investigates the impact of background data selection strategies, such as the target-group approach and background weight correction, on model performance. Through a detailed analysis of different spatial background restrictions and their effects on model accuracy, the study provides valuable insights into improving habitat suitability predictions for Rattus species. It concludes by emphasizing the importance of selecting appropriate background distribution methods and the need for a substantial number of diverse data points to effectively model species distributions.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
Controlling background data selection in presence-only models is crucial for addressing sampling biases and enhancing model performance. While numerous studies have evaluated the impact of various background data selection techniques across different taxa, research remains limited on how spatially restricted background areas and employing random and biased distribution methods, influence model performance for Rattus species predictions. These species often present challenging collection conditions and low trap success rates, potentially leading to spatial biases in the occurrence records that may affect the accuracy of model predictions. Thus, this study examined methods to assess model accuracy variability for Rattus species by applying spatial background restrictions within the study area. These restrictions were defined by four main criteria: (1) areas within islands with documented species occurrences, (2) areas within the species’ extent of occurrence according to IUCN range maps, (3) defined road distance, and (4) varying buffer areas around recorded species occurrences. To further assess the effects of spatial background restrictions on model performance, we used two methods to distribute the background sampling points: random and biased (bias file) method. Our findings demonstrated that the selection of spatial background restrictions and the distribution methods for background sampling points play a critical role in influencing model performance and the accuracy of predicted habitat suitability for Rattus species. Our findings highlight that defining a specific spatial restriction, such as restricting background selection to within 5 km of a road, improves model performance. However, overly narrow or restrictive buffer sizes, such as the 20 km buffer size used in this study, fail to capture the full environmental variability of the species, which can diminish model accuracy. Furthermore, the method used to distribute background sampling points whether random or biased affects species predictive outcomes. To ensure reliable predictions, we recommend a systematic evaluation of different spatial restriction methods and distribution approaches, along with a thorough analysis of their impacts on model performance. This approach not only reveals how outcomes vary across different modeling scenarios but also provides a strong basis for determining the most reliable predictions. By carefully assessing these factors, researchers can refine and optimize habitat suitability models for Rattus species, ultimately enhancing predictive accuracy and ensuring more consistent and dependable results.
Over the past several centuries, significant environmental changes have profoundly affected ecosystems in various ways (Sintayehu 2018). Documented observations have shown that various species are undergoing significant shifts in phenological events, such as changes in the timing of seasonal life cycles (Sintayehu 2018; Doak and Morris 2010; Miller-Rushing et al. 2010; Dawson et al. 2011). At the same time, there are notable alterations in species habitat ranges (Burrows et al. 2011; Doney et al. 2012) and disturbances within food webs (Sintayehu 2018). These ecological disruptions heighten the risk of extinction for numerous species (Manes et al. 2021) and accelerate the spread of pest species (Finch et al. 2021), thereby compounding the challenges to ecosystem stability and resilience. In response, researchers and scientists worldwide advocate for developing and implementing comprehensive strategies to mitigate these impacts and preserve biodiversity. In this context, species distribution models (SDMs) have become crucial. These models generate valuable scenarios of potential distributions using known and projected environmental variables. Such outputs enable researchers to make critical inferences about community assemblages, evolutionary trends, and ecological dynamics’ information often beyond the reach of local field data alone (Murphy and Smith 2021; McShea 2014). Despite their broad applicability, SDMs face critical limitations (Moudrý et al. 2024), one of which is sampling bias (Kramer-Schadt et al. 2013). This occurs when available data are disproportionately clumped in certain areas or regions (Phillips et al. 2009).
Sample bias presents a significant challenge particularly with occurrence records sourced from the Global Biodiversity Information Facility (GBIF), which often derive from museums and observations. These records tend to be incomplete and exhibit spatial biases (Collen et al. 2008; Yesson et al. 2007), thereby increasing spatial uncertainty (Inman et al. 2021). The most frequently reported biases arise from sample selection or survey bias, which significantly compromise the accuracy of SDM outcomes (Franklin 2010; Elith et al. 2011). Global researchers have demonstrated various methods to reduce sampling bias, thereby enhancing model accuracy and more closely reflecting the true distribution of species instead of simply representing the extent of survey efforts. Several methods are known to mitigate sampling bias and improve model performance; the first is environmental filtering. This method involves clustering observations based on environmental conditions within the species dataset to ensure that the covariate space is adequately represented in the dataset (e.g. Valera et al. 2014; Inman et al. 2021). The second method is spatial filtering of occurrence data, where presence records are randomly removed from areas with high sample densities (e.g. Boria et al. 2014; Inman et al. 2021). Thirdly, incorporating covariates related to the sampling effort allows the model to effectively account for biases in data collection without altering the original dataset (see Renner et al. 2015). Alternatively, another method for addressing sampling bias involves controlling the selection of background data. Few of the common strategies are using the target-group approach that guides background data selection using occurrence records of similar or related species (e.g. Phillips et al. 2009; El-Gabbas and Dormann 2018; Barber et al. 2021) or through background weight correction that adjusts for known biases in observation data (e.g. Dudik et al. 2005; Warton et al. 2013; Inman et al. 2021). Another example involves restricting the background selection (e.g. Schartel and Cao 2024), within a specified fraction of the study area to minimize misrepresentation of the range of conditions accessible to the species (e.g., Fourcade et al. 2014; Jarnevich et al. 2017).
Advertisement
In presence-only models, the selection of background data significantly influences the performance of the model, affecting the accuracy of habitat suitability estimations (Anderson and Raza 2010; Jarnevich et al. 2017; Amaro et al. 2023). Ideally, selecting the extent of background data for species distribution models should be based on the species’ ecological characteristics, specifically their dispersal limitations. However, such data are often not available for most species (Anderson and Raza 2010; Barve et al. 2011). Since sampling bias in datasets is often unknown (Fourcade et al. 2014), some studies have resorted to using climate classifications correlated with species occurrences to inform background data selection (e.g., Hill and Terblanche 2014; Hill et al. 2017). Beyond climatic conditions, background data selection can also be guided by incorporating spatial or geographical data related to species occurrences, such as topography (e.g. Lannuzel et al. 2021), road distance (e.g. Kadmon et al. 2004; Chauvier et al. 2021), real-world species geographic ranges obtained from the IUCN Red List (e.g., Ranc et al. 2017; Whitford et al. 2024) and applying buffer areas around occurrence records (e.g. VanDerWal et al. 2009; Fourcade et al. 2014; Whitford et al. 2024). Moreover, placement or distribution of background sampling points is crucial because these points contrast the observed distribution of species’ presences across various environmental conditions within the study area (Valavi et al. 2022; Phillips and Dudík 2008). Background sampling points in SDMs are typically distributed uniformly at random to represent the range of environments across the study area more effectively. However, this method may not always fully address potential sampling biases (Phillips et al. 2006, 2009), which could skew the model outcomes. Another method involves incorporating a bias file such as a density map (Elith et al. 2011; Fourcade et al. 2014), which is considered useful if certain areas have been sampled more intensively in specific geographical or environmental areas, thus this approach aims to balance the model so it does not overly focus on these well-sampled regions at the expense of less sampled areas (Phillips et al. 2009; Fourcade et al. 2014; Elith and Leathwick 2009; Elith et al. 2011; Ranc et al. 2017; Barber et al. 2021).
To the authors’ knowledge, the impact of background restrictions—particularly the use of spatial data and two specific methods for distributing background sampling points to enhance model performance—has not previously been explored for Rattus species. These species often exhibit challenging collection conditions (Brown et al. 1999; Leung et al. 1999; Htwe and Singleton 2014; Stuart et al. 2015) and low trap success rates (e.g. Lorica et al. 2022), attributed to their neophobia, or fear of new objects (i.e. live trap) (Barnett 1988). This behavior may lead to spatial biases in occurrence records, which could impact the accuracy of model predictions. Thus, this study represents an initial effort to use spatial data to strategically guide the distribution of background samples, both randomly and biased, across the study area. This approach aims to mitigate sampling bias and enhance model performance in predicting Rattus species distributions. In this study, “Spatial Background Restriction” refers to the spatial data criteria used for background data selection, which includes: (1) areas within islands with documented species occurrences, (2) islands within the species’ extent of occurrence as defined by the IUCN range map, (3) defined road distance, and (4) by applying varying buffer sizes around recorded species occurrences. The main objective of this study was to evaluate how these various spatial background restriction strategies affect the model performance for Rattus species predictions. Furthermore, we investigated these strategies under two conditions: random and biased distribution of background sampling points, to assess their respective impacts on model performance.
Materials and methods
Study area and occurrence records
Some rodent species in the Philippines pose significant economic and health challenges. Certain Rattus species, for example, negatively impact rice yields (Aplin et al. 2003), act as vectors for human pathogens (e.g., Mendoza and Rivera 2019), and affect populations of endemic rodent species as invasive competitors (Rickart et al. 2011), making their management a national priority. Therefore, generating accurate statistical habitat-based analyses for these species may contribute to optimizing rodent management strategies in the country. However, due to the limited availability of geographical records for many rodents, particularly within the Rattus species in the country, this study focuses on those Rattus species for which there are sufficient occurrence records to construct a model (see Hernandez et al. 2006; Pearson et al. 2007; Proosdij et al. 2016; Sampaio and Cavalcante 2023). Thus, the study only includes three invasive species: Rattus tanezumi (Oriental Asian House Rat), Rattus norvegicus (Brown Rat), Rattus exulans (Polynesian Rat), and one endemic species, Rattus everetti (Philippine Forest Rat). We collected occurrence data of the four Rattus species from various sources, including the Global Biodiversity Information Facility (GBIF) (https://www.gbif.org), the synopsis of the Philippine Mammals (https://www.fieldmuseum.org), and data records from the University of the Philippines Los Banos-Museum of Natural History (UPLB-MNH). The records acquired from the two latter sources were georeferenced following the recommendation of Wieczorek et al. (2004). Meanwhile, to ensure record data accuracy sourced from GBIF, we used the package “CoordinateCleaner ver 3.0” (Zizka et al. 2019) in R ver. 4.1.2 (R Core Team) to help us eliminate data associated with country or province centroids, open oceans, and locations of biodiversity institutions such as museums, zoos, and universities, as well as to check for and remove any duplicated records. This resulted in a dataset comprising 201 records for R. everetti, 170 for R. tanezumi, 157 for R. exulans, and 23 for R. norvegicus (see Supplemental Fig. S1).
Environmental data
For our environmental data, we utilized a 10-m resolution land cover map with ten categories (ESRI 2021), nineteen climatic variables at 30 arc-seconds (~ 1 km by 1 km spatial resolution, www.worldclim.org) sourced from Fick and Hijmans (2017). Further, we also used ~ 1 km by 1 km resolution data from Venter et al. (2018) that represents cumulative human pressure on the environment. This dataset quantifies the impact of human activities like settlement, agriculture, and transportation on natural landscapes, essential for assessing the impact of habitat degradation on the occurrence of species. We employed the R package “multiColl ver 2.0” to assess cross-correlation among the thirty environmental variable predictors and reduce the risk of incorrect inferences and predictions (Salmerón et al. 2021). We applied a cut-off threshold to mitigate the influence of highly correlated variables. Criteria such as pairwise correlation of coefficients |r| exceeding 0.7, Condition Numbers (CN) surpassing 30, Determination of Correlation Matric (D) approaching 1, indicating no collinearity, and Variance Inflation Factors (VIF) exceeding ten guided the selection of robust environmental predictors for Rattus species (Dormann et al. 2012). All these data were processed within GRASS GIS to aggregate them into a ~ 1 km by 1 km spatial resolution dataset (see Supplemental Tables S1 and S2).
Advertisement
Sampling bias correction method
The data records for four Rattus species may lack uniformity across their distribution ranges, as some regions were more heavily sampled than others (Supplemental Figs. S1 and S4). This selective sampling can lead to uneven distribution of records across the country. To address these sampling biases, we guided the background data selection by implementing various spatial background restrictions, detailed in the subsequent sections.
Spatial background restrictions
According to Merow et al. (2013), users have considerable flexibility in determining how to distribute background points within the study area. This flexibility is crucial because mitigating sampling bias often lacks universal guidelines or solutions (Fourcade et al. 2014). This is particularly true when dealing with data from archival sources, where we frequently lack prior knowledge about the inherent biases in the datasets. Thus, we adopted four key approaches to restrict background selection across the study area for predicting habitat suitability for Rattus species. First, we only restricted the background data selection to the islands with only documented species occurrence records (Supplemental Fig. S1). This involved excluding islands that lacked documented occurrence records based on the available records (Supplemental Fig. S5). Secondly, we referenced the International Union for Conservation of Nature (IUCN) to delineate the extent of occurrence for each Rattus species (concept adapted from Ranc et al. 2017; Whitford et al. 2024). The range maps provided by IUCN, represent the extent of occurrence for Rattus species (see Supplemental Fig. S2), guided our exclusion of islands without the historical presence of these species, ensuring the background data selection was confined to areas within their verified range (Supplemental Fig. S6). The third approach utilized a road density map of the study area sourced from the National Mapping and Resource Information Authority (NAMRIA) (https://isportal.namria.gov.ph), focusing on distances from roads (Kadmon et al. 2004). We restricted the background data selection within a 5 km distance from roads, based on computational analysis showing that approximately 90% of four Rattus species records fell within this range (Supplemental Fig. S3). The decision to restrict background data selection to within a 5 km radius of the road arises from practical limitations in field data collection. This radius is presumed to represent the maximum distance within which data collectors can feasibly access and place traps, as areas beyond this range are considered less accessible to observers (Supplemental Fig. S7). Lastly, we defined three buffer width areas from the occurrence points (e.g., Whitford et al. 2024) with radii of 20 km, 40 km, and 80 km to restrict the background area selections, based on the premise that these areas correspond to varying levels of surveyor access, which in turn influences data collection for Rattus species. Each buffer width area represents a hypothesized degree of accessibility, critical for modeling the potential distribution of data collection efforts across various geographic extents. This approach allows us to systematically assess how differences in access levels impact the performance of our models for Rattus species. The 20 km radius signifies areas that are highly accessible, facilitating frequent and comprehensive data collection (Supplemental Fig. S10). The 40 km radius encompasses moderately accessible areas, often more remote, where data collection can be less consistent (Supplemental Fig. S11). The largest buffer, with a radius of 80 km, represents the outermost practical limit for surveyor operations, where data collection is expected to be the least frequent, largely due to logistical constraints to cover this extensive range (Supplemental Fig. S12).
Furthermore, we implemented a combined approach by integrating a 5 km road distance with other three spatial background restrictions: islands with documented occurrences, IUCN ranges, and buffer width areas. This methodological integration aims to evaluate whether the inclusion of road features enhances the utility and effectiveness of three spatial background restrictions, compared to using these restrictions independently. Firstly, we used combined spatial background restrictions of the island, which had documented species occurrence records along with a 5 km road distance (Supplemental Fig. S8). This approach allows us to restrict the background area to islands with documented species occurrences, while also considering road accessibility within these islands. Secondly, we employed a method that combined spatial background restrictions based on the IUCN species’ extent of occurrence and 5 km road distance (Supplemental Fig. S9). This approach was designed to enable the model to incorporate information from both ecologically significant areas for Rattus species and areas that are accessible for data collection by observers, all within the species’ extent of occurrence as determined by the IUCN. Lastly, we also integrated the 5 km road distance within buffer areas of varying widths (20 km, 40 km, and 80 km), this combined approach aims to assess the differential impacts of road distance on the coverage of potential survey areas across a range of buffer distances in improving the model performance of Rattus sp. (Supplemental Figs. S13 to S15).
Distribution of sampling background points across study area
In this study, we compared and evaluated two different methods for distributing background sampling points using the FLEXSDM package version 1.3.4 (see Velazco et al. 2022). The first method, as widely documented (Phillips and Dudík 2008), involved randomly distributing the background sampling points across the entire study area to ensure a balanced environmental representation (Phillips et al. 2006). Our second method utilizes a biased distribution of background sampling points, concentrating on or giving more weight to areas of high sampling intensity (e.g., Ranc et al. 2017) for each Rattus species. We generated species-specific density maps using a Gaussian Kernel (Elith and Leathwick 2009; Elith et al. 2011) and utilized these maps as a bias file to influence the distribution of background points. This method assumed that high-density areas on the maps (refer to Supplemental Fig. S4) indicate regions of heightened historical sampling effort. Background points were, therefore, sampled more frequently from these probability surfaces, increasing the likelihood of selecting areas with more extensive sampling histories (Barber et al. 2021). This strategy is designed to enhance the model’s capacity to highlight specific habitats occupied by the species, rather than just areas that are heavily sampled (Ferrier et al. 2002; Dudik et al. 2005; Phillips et al. 2009). Refer to Supplemental Fig. S5 to S16 for a more comprehensive illustration.
Model scenarios
Given the extensive range of our study area, we used 10 000 background sampling points to contrast with the presence location records for each of the four Rattus species, following the recommendations of Phillips and Dudik (2008) and Barbet-Massin et al. (2012). This number of points is recognized as sufficient for building predictive models, effectively enhancing overall model performance. Using random and biased distribution methods, we analyzed the spatial background restrictions separately, creating distinct model scenarios (refer to Table 1). In our analysis, we identified two of the twenty-four model scenarios as “reference model” that did not incorporate the specific spatial background restrictions developed for this study (Supplemental Fig. S16). The first reference model employed a random method, distributing background sampling points uniformly across the study area. The second reference model used a biased approach, utilizing a density map as a bias file to strategically distribute background sampling points. These reference models established baselines for assessing the impact on model performance when spatial restrictions are applied, compared to scenarios where the background area is unrestricted by geographical information.
Table 1
Background sampling point distribution methods and types of spatial background restrictions employed in this study to create the model scenarios
Scenario (S)
Background sampling points distribution method (random/biased)
Spatial background restriction
Islands with documented records
IUCN range map
5 km road radius
20 km buffer width
40 km buffer width
80 km buffer width
S1
Random
○
−
−
−
−
−
S2
Random
−
○
−
−
−
−
S3
Random
−
−
○
−
−
−
S4
Random
○
−
○
−
−
−
S5
Random
−
○
○
−
−
−
S6
Random
−
−
−
○
−
−
S7
Random
−
−
−
−
○
S8
Random
−
−
−
−
−
○
S9
Random
−
−
○
○
−
−
S10
Random
−
−
○
−
○
−
S11
Random
−
−
○
−
−
○
S12
(reference model)
Random
−
−
−
−
−
−
S13
Biased
○
−
−
−
−
−
S14
Biased
−
○
−
−
−
−
S15
Biased
−
−
○
−
−
−
S16
Biased
○
−
○
−
−
−
S17
Biased
−
○
○
−
−
−
S18
Biased
−
−
−
○
−
−
S19
Biased
−
−
−
−
○
−
S20
Biased
−
−
−
−
−
○
S21
Biased
−
−
○
○
−
−
S22
Biased
−
−
○
−
○
−
S23
Biased
−
−
○
−
−
○
S24
(reference model)
Biased
−
−
−
−
−
−
‘○’ indicates that the corresponding spatial background restriction is integrated into the model scenario, whereas ‘–’ denotes that the restriction is not included
Model performance
We used the Maximum Entropy (MaxEnt) modeling approach to demonstrate the objectives of our study, specifically utilizing the `SDMTune` package ver. 1.3.1 (see Vignali et al. 2020) to build the final models for Rattus species. We used `MIAmaxent ver.1.2` (Vollering et al. 2019), which aided in selecting important environmental variable predictors for each Rattus species (see Supplemental Table 3). This package provides tools for user-controlled transformation of explanatory variables, selection of variables by nested model comparison, and flexible model evaluation. This method ensures simpler yet effective models with reduced overfitting risk, focusing on variables that significantly enhance predictive accuracy for each Rattus species (Vollering et al. 2019). For further demonstration, we computed the variable importance (%) of the final environmental variables for each Rattus species using `SDMTune` package ver. 1.3.1 (Supplemental Figs. 17 to 20). Additionally, we generated the marginal response curves for each model scenario with respect to these variables (Supplemental Figs. 21 to 28). We also illustrated the response curves under two different methods of distributing background points: the random method and the biased method (Supplemental Figs. 29 to 31).
We partitioned the dataset using a threefold cross-validation approach to optimize it for both training and validation purposes. Additionally, to assess the robustness of our models, we carried out 100 individual runs for each species model. We then averaged the outcomes of the 100 replications to create a single predictive map for each species per model scenario. A single map represents the average predictive metric outcome across all 100 iterations, providing a more reliable result of the model’s performance. We evaluated the model’s accuracy using two key metrics: the Area Under the Curve (AUC) and True Skill Statistics (TSS). A value greater than 0.5 for the AUC metric suggests that our model’s prediction is better than chance (Liu et al. 2018). For the TSS metric, a value below 0.40 indicates poor model performance, while a value above 0.8 is considered excellent (Pramanik et al. 2018).
Furthermore, we assessed habitat suitability for each species generated from different model scenarios on a scale from 0 to 1.0. Scores between 0 and 0.4 imply very low suitability, scores between 0.6 and 0.8 indicate moderate suitability, and scores above 0.8 up to 1.0 signify very high suitability (Pramanik et al. 2018). Additionally, we employed Schoener’s D as a metric to quantify and assess the extent to which model scenarios employing spatial background restrictions produced habitat suitability patterns similar to those of the reference model. Values of Schoener’s D approaching 1 signify a high degree or complete similarity, whereas 0 denotes no similarity (Schoener 1968). To facilitate the interpretation of these similarities, we adapted the categorical ranges suggested by Rödder and Engler (2011) and Santamarina et al. (2023). Specifically, we defined the ranges in this study as follows: 0–0.5 indicates no or very low similarities, 0.5–0.7 denotes low similarities, 0.7–0.8 represents moderate similarities, 0.8- 0.9 indicates high similarities and 0.9–1.0 signifies very high similarities.
Results
Model performance and evaluation across model scenarios
In evaluating twenty-four model scenarios across four Rattus species, the random background distribution method yielded higher performing models than the biased background distribution method across Rattus species (Fig. 1). Specifically, for the three species—R. everetti, R. exulans, and R. tanezumi—each with over 100 recorded occurrences (Fig. 1a to f), showed a consistent pattern of high performance in models S3 (random background points + 5 km road distance), S4 (random background points + 5 km road distance + documented known occurrence), and S5 (random background points + 5 km road distance + IUCN species occurrence range). The models demonstrated good performance, achieving AUC values between 0.85 and 0.88 while the TSS values ranged from 0.56 to 0.62, further confirming the model’s reliability (Fig. 1a to f). For R. norvegicus, which had the fewest records at 23, the highest model performance was observed under models S1 (random background points + documented known occurrence), S2 (random background points + IUCN species occurrence range), and S12 (reference model under random method). Each model achieved AUC values of 0.85 to 0.86, while the TSS values ranged from 0.59 to 0.60 indicating good performance (Fig. 1g, h). Across all four species, restricting the spatial background to a 20 km buffer size, whether using a random or biased distribution of background points, consistently resulted in lower model performance across twenty-four different scenarios.
Fig. 1
Model performance evaluation for R. everetti (a and b), R. tanezumi (c and d), R. exulans (e and f) and R. norvegicus (g and h). The figure indicates each Rattus species’ evaluation of AUC (leftward) and TSS (rightward). The left y-axis of the figures denotes the range of AUC and TSS values of model metrics, while the right y-axis indicates the interpretation of the model metric values. The figures are color-coded, with green denoting the random background distribution method, yellow indicating the biased distribution method, and white corresponds with the reference models
Model performance and habitat suitability comparison with reference model
Our analysis showed that some model scenarios with spatial background restrictions had high to very high similarities with reference models S12 (random) and S24 (biased) based on Schoener’s D index, as demonstrated in Figs. 2, 3 and Supplemental Figs. 32 to 34). Within the random background sampling point distribution, models S1, S2 and S8 demonstrated high similarities (D > 0.9) in habitat suitability with the reference model S12 for R. everetti (see Figs. 2 and 3). For R. tanezumi, models S1, S2 and S3 showed high similarities (D > 0.9) with the reference model S12 (Fig. 2 and Supplemental Fig. S32). In the case of R. exulans, models S1, S2, S3, S4, S5, and S8 exhibited high similarities (D > 0.9) with the reference model (Fig. 2 and Supplemental Fig. S33). In terms of model scenarios that surpassed the AUC of the reference model (S12), our analysis showed that for these three Rattus sp., restricting the background area to a 5 km road distance alone (S3) or in combination with other spatial background restrictions (S4, S5, S10, and S11) performed better than the reference model (S12) (Fig. 1). This demonstrates their enhanced capability to improve model performance and address sampling bias in the available datasets of these three Rattus species, compared to simply distributing the background sampling points randomly across the study area (refer to Fig. 2). For R. norvegicus, which has only 23 occurrence records, all model scenarios except model S6, S7, S9, and S10 demonstrated high to very high similarities in habitat suitability predictions compared to the reference model (see Fig. 2 and Supplemental Fig. S34). However, none of these model scenarios outperformed the reference model (S12) for this species (see Fig. 2). Meanwhile, model S6, which utilizes a 20 km buffer width size, consistently recorded the lowest AUC values across all four Rattus species, indicating its negative impact on model performance. Additionally, integrating road features within the 20 km buffer did not notably improve model performance, especially noticeable in R. norvegicus (refer to Fig. 2).
Fig. 2
Relationship between the model performance (AUC) and similarities in habitat suitability prediction with the reference models (Schoener’s D index) among scenarios, where 0 signifies no resemblance and 1 represents identical predictions with their respective reference models (S12 or S24). Red dots denote scenarios using the random distribution method, while blue dots represent those employing the biased distribution method
Differences in the habitat suitability predictions with the reference models for R. everetti. This figure compares various model scenarios (S1 to S8 and S10 to S17) alongside the reference models (S9 and S18), highlighting their similarity levels. Maps that appear without clear red and blue coloration represent scenarios where habitat suitability predictions align closely or are identical to those made by the reference models (S9 and S18). The presence of blue and red hues highlights areas of under-predictions (blue) and over-predictions (red)
Employing the biased background sampling point distribution method yielded varied results across different Rattus species. For R. everetti and R. tanezumi model S13, which focused on islands with documented occurrences, S14 utilizing the IUCN species range, and S19 and S20 employing 40 km and 80 km buffer sizes, respectively, demonstrated a very high degree of similarity (D > 0.95) in habitat suitability to the reference model (S24) (Figs. 2 to 3 and Supplemental Figs. 14 to 15). As for R. exulans, all model scenarios exhibited high similarities (D > 0.9) with the reference model. Notably, spatial background restrictions incorporating a 5 km road distance alone (S15) or with other spatial restrictions (S16, S17, S22, and S23) consistently outperformed the reference model (S24) for these three Rattus species (Fig. 2). In the case of R. norvegicus, all model scenarios showed high degrees of similarity (D > 0.9) with the reference model (S24) (Fig. 2 and Supplemental Fig. S16). However, none of these model scenarios surpassed the AUC performance of the reference model, with all closely matching its performance for R. norvegicus predictions (Fig. 2). Meanwhile, model scenarios implementing a spatial background restriction of a 20 km buffer size consistently showed the lowest AUC values across all four Rattus species, highlighting its negative impact on model performance even with a biased distribution method (Fig. 2).
Differences in variable importance and response curves across model scenarios
Our analysis confirmed that the variable importance of key environmental factors, particularly BIO1 (Annual Mean Temperature), maintained consistent significance across all model scenarios for each of the four Rattus species studied. This uniformity is evident from the highly stable marginal response curves of BIO1, which showed no to minimal variation across different scenarios employing various methods of spatially restricting the background area. This consistent behavior highlights BIO1’s strong predictive power and establishes it as a critical determinant in predictive models for Rattus species. Conversely, other environmental variables displayed significant variability in their marginal response curves across scenarios. This inconsistency indicates that these less influential variables are more susceptible to fluctuations in model settings and background conditions, thereby affecting their reliability and influence on the predictive outcomes (Supplemental Figs. 21 to 28). This pattern of consistency and variability extends to scenarios employing both random and biased methods for distributing background sampling points (Supplemental Figs. 29 to 32).
Discussion
This study demonstrated that spatially restricting background data selection significantly improved model performance compared to scenarios without such restrictions (reference models S12/S24). Nevertheless, some model scenarios exhibited high similarities with the reference models (refer to Fig. 2). This suggests that if the applied spatial background restrictions encompass a similar range of environments or spatial extent as the reference model, employing these restrictions may not result in any difference in the model’s approach to handling sampling bias or influencing its habitat suitability predictions, compared to the approach used by the reference model. However, differences in model performance may still occur. For instance, among the various spatial background restrictions analyzed, employing a 5 km-road distance either independently or in combination with other spatial restrictions, consistently improves model performance of habitat suitability predictions (Stockwell and Peterson 2002), particularly for R. everetti, R. exulans, and R. tanezumi, each with over 100 occurrence records (Fig. 2 and Supplemental Figs. S17 to S19). The influence of roads on model performance varies across studies. Kadmon et al. (2004) note that the impact of road bias on model predictions is influenced by the distribution of environmental features, such as climatic variability, within the geographic extent of the road network. In particular, the extent to which roadside bias impacts model predictions depends on how effectively the road network represents and captures the species’ overall habitat range. In this analysis, the inclusion of a 5 km road distance likely captured diverse climatic and environmental gradients within the study area. When road features were combined with other spatial background restrictions, such as islands with documented occurrences, IUCN range data, and 80 km buffer (see Figs. 1 and 2), these spatial restrictions performed better in terms of AUC values. This combined spatial representation of environmental conditions likely provided a strong basis for model learning by encompassing a representative cross-section of broader environmental conditions relevant to the species. This enabled the model to more accurately reflect the ecological preferences of Rattus species, thereby enhancing its performance.
However, if the road network is confined to areas with a narrow climatic profile, it can amplify bias and lead to skewed predictions (Franklin 2010). Such bias may cause overestimations or underestimations of environmental suitability (Phillips et al. 2009; Stolar and Nielsen 2015; Yackulic et al. 2013) because the spatial coverage and environmental diversity of the background area may significantly influence model outcomes. In this study, incorporating a 5 km road distance within a 20 km buffer area (models S9 and S21) showed some improvement in the model performance, though the improvement was not substantial. This limited improvement is likely because the combined spatial background restriction (20 km buffer + 5 km road distance) captured only a subset of the environmental gradients associated with road features, failing to represent the broader range of conditions across the study area (see Figs. 1 and 2). By contrast, using a 20 km buffer area without road features (models S6 and S18) revealed more significant challenges that affected model prediction accuracy. The model tended to become overly focused on areas with dense records, performing well on the training data but poorly on new or unseen data, a phenomenon known as overfitting. At the same time, it failed to capture key environmental conditions and the broader ecological context of Rattus species outside the narrow buffer, resulting in underfitting (Radosavljević and Anderson 2014). These spatial background restrictions reduced the model’s ability to generalize effectively across the larger study area (Elith et al. 2006; Lobo et al. 2008; Chefaoui and Lobo 2008; Anderson and Gonzalez 2011; Radosavljević and Anderson 2014; Barve et al. 2011; Amaro et al. 2023). The analysis from this study also demonstrated that the response curves for each environmental variable, presented in Supplemental Figs. 21 to 31, are influenced by the spatial background restrictions applied to the environmental space. Restricting the background area, for example, to islands with documented species occurrences, proximities to roads, the International Union for Conservation of Nature (IUCN) extent of occurrence, or buffers around known occurrences leads the model to recalibrate its correlation between species presence and environmental factors within the newly constrained space. This recalibration often results in altered response curves, particularly for less informative variables, as their contributions to the model are more sensitive to changes in the background area. In contrast, variables with substantial predictive value tend to display more consistent response curves, reflecting their stability against variations in the background constraints.
Correspondingly, Gaul et al. (2020) highlighted that although spatial bias can decrease model performance, sample size is ultimately more crucial. This becomes more problematic when the few available occurrence records are highly concentrated in localized areas with distinct environmental conditions, potentially skewing model results significantly (Boria et al. 2014; Baker et al. 2022; Rocchini et al. 2023). For example, in this study, the effect is particularly pronounced with R. norvegicus (see Fig. 2 and Supplemental Fig. S16). These species occurrence records were predominantly found in urban areas within the study region (see Supplemental Fig. S1 and S4). This clustering, alongside a limited number of records (N = 23), may amplify bias, affecting the accuracy of model predictions and performance. Although spatial background restrictions were employed, no significant differences in model performance were observed across the scenarios (see Fig. 2). This may suggest that these restrictions do not adequately mitigate this bias (Supplemental Fig. S16), possibly because the model remains sensitive to the specific conditions present in these limited data points. With few occurrence records, effectively validating the model across diverse environmental settings becomes challenging and may result in models that develop skewed or overly narrow ecological niche predictions due to the concentrated nature of the data (Supplemental Fig. S20).
Additionally, this study also showed that the method of distributing background sampling points can significantly affect model performance. Analysis from this study exhibited that randomly distributing background points consistently yields higher model performance (AUC) compared to using a biased distribution method (see Fig. 2). Randomly distributing background points across the study area is a standard method that effectively captures a wide range of environmental conditions. This approach enhances the model’s ability to differentiate between suitable and unsuitable habitats across various environmental scenarios (e.g., Phillips et al. 2009; Re et al. 2023) that potentially help improve the model performance. However, while simply distributing background sampling points randomly across the study area can enhance model performance, it can sometimes misrepresent the range of conditions realistically accessible to the species, potentially affecting the model’s predictions (e.g., Phillips et al. 2009; Wisz et al. 2008; Jarnevich et al. 2017). To address this drawback, several studies recommend that the extent of background data selection should be defined either geographically or environmentally to further improve model accuracy (e.g., Mateo et al. 2010; Bedia et al. 2011; Barbet-Massin et al. 2012; Lyu et al. 2022; Schartel and Cao 2024). This approach enhanced the model’s ability to accurately represent the species’ potential habitats, improving overall prediction accuracy and performance, as also shown in the findings of this study (see Fig. 2).
On the other hand, while some studies suggest using background data from areas with known biases, such as high sampling intensity, can enhance model accuracy (e.g., Ponder et al. 2001; Ferrier et al. 2002; Phillips et al. 2009; Tong et al. 2023), our current study observed that the impact of employing a biased method for distributing background sampling points varies with the ecological niche preferences of the Rattus species analyzed (see Fig. 2). For generalist species such as R. tanezumi, R. exulans and R. norvegicus, concentrating background points solely in high- density areas can negatively impact model performance. This effect likely arises because such concentration biases the model towards the environmental conditions of these areas, which may not fully represent the broader ecological range suitable for these species. Consequently, this may probably limit the model’s ability to generalize to other areas where these species might be present but are less frequently sampled (Lütolf et al. 2006; Botella et al. 2020). On the other hand, for R. everetti, a species restricted to forest areas, the biased distribution method seems more beneficial approach. This species’ presence and survival are closely linked to forest environments (Heaney et al. 2010), suggesting that a targeted approach in these areas could benefit the model reliability. As demonstrated in Fig. 2, the model performance (AUC) for R. everetti shows minimal variation between random and biased distribution methods, indicating that both approaches perform similarly for this species. However, because R. everetti occupies a specialized habitat, using a biased method that reflects the geographical biases present in the occurrence data can be particularly advantageous. This approach ensures the model encompasses the actual representation of the essential environmental conditions, rather than just the area’s most frequently studied. Therefore, it is possible that if the occurrence data are geographically biased and align with known ecological information about where the species (i.e. rare/ specialist species) is consistently found, employing a biased method is likely more helpful than a random one. Nevertheless, this decision must be carefully considered, taking into account various factors beyond the placement or distribution of background sampling points to effectively meet the specific modeling challenges and objectives.
Conclusion and usage notes
This study has a key limitation: our analysis focused exclusively on geographical information. Despite this, we have still demonstrated the importance of applying spatial restrictions to background data in presence-only models, especially for Rattus species. It shows that specific methods can significantly enhance model performance, providing an initial strategy to mitigate sampling biases for Rattus species. For instance, spatially restricting background selections within a 5 km road distance has been shown to enhance model performance for predicting Rattus species. However, the choice of buffer width size is critical; a spatial extent that is too narrow, such as 20 km buffer width used in this study, failed to capture the species’ broad ecological variability, potentially negatively impacting model performance. The selection of background distribution methods for species distribution modeling, whether random or biased, should be informed by their effectiveness in accurately representing the environmental conditions crucial for the species under study. Each method has unique advantages that should align with the spatial distribution of the available data. A random distribution method is typically preferred by many because it captures a broad ecological spectrum and prevents bias towards areas with dense data accumulation. This approach ensures that the model comprehensively represents diverse environmental conditions, essential for accurately generalizing species distribution across varied habitats. Conversely, a biased distribution method is more likely advantageous in regions where data are densely clustered. By focusing the background data in these areas, it corrects for potential over-representation and achieves a balanced environmental portrayal across the model’s entire range. This strategy is crucial for maintaining the accuracy of the model, especially in studies where specific locations are disproportionately sampled. Ultimately, the choice of distribution method must be tailored to the specific needs of the study, ensuring that the model reflects the true environmental conditions necessary for reliable and accurate predictions. Furthermore, an important observation from this study is that while mitigating sampling bias is crucial, the quantity of species occurrence records is even more critical. This is particularly significant when the background area selection is broad, but the species records are sparse and concentrated in only a few areas. This scenario highlights the need for a substantial number of diverse data points to effectively model species distributions, especially in extensive study areas.
Generally, our efforts have demonstrated how model performance responds to various spatial background restrictions and distribution methods (random and biased) in predicting habitat suitability for Rattus species, modelers have the discretion to choose methods and flexibly define the extent of background selection (Merow et al. 2013; Whitford et al. 2024). This approach carries the responsibility of ensuring that the predictive outcomes are theoretically sound, ecologically meaningful, and statistically rigorous. Fourcade et al. (2014) noted that there are no universal guidelines for mitigating sampling bias and enhancing model performance, often due to datasets lacking clear indications of inherent biases. Thus, we suggest that investigating different methods and analyzing their impact on model performance is a good practice. This approach reveals various outcomes under different scenarios and is crucial for making informed decisions about which model predictions are most reliable.
Acknowledgements
The authors would like to express their appreciation to the curator of small mammalian and other wildlife, Associate Professor Phillip A. Alviola of the University of the Philippine Los Baños- Museum of National History, for sharing his pearls of wisdom with the authors during the course of this research. The support of the Ministry of Education, Culture, Sports, Science and Technology of Japan is highly acknowledged for providing academic financial assistance to the corresponding author. However, they have no involvement in the research process or the conclusions drawn. This work is independent of the funding source.
Declarations
Conflict of interest
The authors declare that they have no competing interests.
Human and animal rights and informed consent
None.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.