Introduction

Recent analysis of earth observation satellite data suggested that approximately 230 million ha of forest were lost due to disturbance globally in the period between 2000 and 2012 (c.19 million ha per year) with the greatest loss occurring in the tropics (Hansen et al. 2013). Rates of deforestation in South East Asia are high and accelerating, with Indonesia showing the largest recent increase in the rate of forest loss globally, while Malaysia has the highest level of forest loss in relation to land area (Hansen et al. 2013). Harris et al. (2012) estimate that 32% of carbon emissions from land-use change in the tropics arose in South and South-east Asia, and Indonesia was the second highest emitter, accounting for 13% of total CO2 emissions from deforestation.

In light of these impacts and international commitments to reduce carbon emissions from deforestation and degradation (REDD+) in developing countries it is important to understand the dynamics of land-use change and predict which areas are at highest risk of forest loss. Multiple studies have attempted to quantify and predict future deforestation, most notably in the Amazon basin (Soares-Filho et al. 2006; Rosa et al. 2013), and in other forests around the world (Rideout et al. 2013; Vieilledent et al. 2013); however, to date we know of no analyses that predict future forest loss across the full extent of Borneo.

Published analyses of forest loss have employed a range of statistical techniques including logistic regression (Chowdhury 2006; Echeverria et al. 2008), genetic algorithms (Venema et al. 2005; Soares-Filho et al. 2013), weights of evidence (Soares-Filho et al. 2010; Maeda et al. 2011) and cellular automata (Thapa et al. 2013). These models are based on a number of environmental and anthropogenic landscape features that are thought to drive deforestation rates. Commonly, these include distance to important features, as well as topography. Few studies have rigorously compared the performance of different approaches for modelling forest loss. To rigorously compare different modelling approaches in their performance in predicting forest loss Pontius et al. (2008) argue that the methods being compared must be applied to the same study landscape, but very few published studies have done this.

Logistic regression is almost certainly the modelling approach that has been most commonly used to predict forest loss (e.g., Chowdhury 2006; Echeverria et al. 2008), and has been the dominant method in multi-scale landscape modelling (McGarigal et al. 2016). Random forest (RF; Breiman 2001) is increasingly used in a range of applications including digital soil mapping (Grimm et al. 2008), forest biomass mapping (Baccini et al. 2012), species distribution modeling (Evans and Cushman 2009) and others given its often superior performance compared to other methods (Evans et al. 2011). RF is also gaining prominence in land-use classification (e.g., Aide et al. 2013; Grinand et al. 2013), where it outperforms classification and regression trees (CART; Rodriguez-Galiano et al. 2012) and maximum likelihood classifiers (Schneider 2012). Nonparametric procedures like RF are particularly effective at identifying complex multivariate associations, such as those that affect patterns of forest loss. Few analyses of forest loss have employed random forest (but see Aide et al. 2013; Grinand et al. 2013), and none to our knowledge quantitatively compare the performance of random forest to other commonly used modelling approaches, such as logistic regression, or utilize multi-scale optimization (sensu McGarigal et al. 2016).

Past applications of random forest have used several approaches for selecting the number of occurrence and non-occurrence cells used to train the model. For example, Chawla et al. (2003) and Chen et al. (2004) found that imbalance between the proportion of presence and absence classes can cause bias in the prediction and model-fit. They found that when an imbalanced sample is present, the bootstrap of the data is biased towards the majority class, thus over-predicting the majority-class and under-predicting the minority. The resulting model fit can be deceptive, exhibiting very small overall error due to very small errors in the majority as a result of extremely high cross-classification error from the minority class. An alternative solution that is often used when there are many more absences than presences in a classification dataset is to shift the cutoff for the probability of present from the true ratio to 0.5 or something smaller (Evans and Cushman 2009). Given that unbiased predictions are essential for predictive models of forest loss to be reliable, it is critical to evaluate the effect of sample ratio and find the ratio that produces unbiased estimates of forest loss risk.

The idea that ecological patterns and processes interact across scales in space and time is a fundamental tenant in landscape ecology (Wiens 1989; Levin 1992). Despite this widespread recognition, a review of the current literature found that most habitat ecology papers fail to address multiple spatial pattern-process relationships, and less than 5% have optimized multi-scale relationships (McGarigal et al. 2016). While many recent studies of land-use change allude to spatial scales, they typically do so implicitly by including distance to various landscape features (roads, rivers, population center, etc.) as a metric (e.g., (Rideout et al. 2013; Rosa et al. 2013; Vieilledent et al. 2013), which is not a true multi-scale analysis (sensu McGarigal et al. 2016). To our knowledge there have been no multi-scale optimization efforts applied to landscape change modeling. Additionally, while several analyses consider some element of landscape composition (e.g., Rosa et al. 2013; Vieilledent et al. 2013; Rideout et al. 2013), we are aware of no studies that have considered both the composition and configuration of the multi-scale neighbourhood around a point as a predictor variable.

There are six main goals of this study. First, we evaluated the effects of varying the ratio of loss to persistence cells in the training data set on the predicted probability of forest loss in both logistic regression and Random Forest. Second, as suggested by Pontius et al. (2008), we sought to compare the performance of three modelling methods (random forest, logistic regression and a naïve model) by applying them in the same landscape in the same time interval. Third, we sought to formally evaluate the utility of using landscape structure as predictors by computing models with and without landscape metrics. Fifth, we sought to identify variables and their operative scales that best predicted forest loss or forest persistence between 2000 and 2010 and compare them between the three nations that comprise Borneo (Kalimantan Indonesia, Malaysian Borneo and Brunei). Sixth, we sought to apply the best multi-scale models for each of these nations to predict the future risk of forest loss across Borneo in the 2010–2020 time period. We had several hypotheses. First, consistent with Chawala et al. (2003) and Chen et al. (2004) we expected that the probability map produced by random forest would under predict loss when the ratio of loss in the training sample was equal to that in the real landscape, in which persisting forest cells far outnumber forest loss cells. Second, we expected that random forest would outperform logistic regression and the naïve model in each nation. Third, we expected that models including landscape variables would have substantially higher performance than models that excluded them. Fourth, we predicted that risk of forest loss would be related to predictor variables similarly across the three nations. Fifth, we expected that forest loss risk would be related to (a) topography, with high risk at low elevations and flat areas, and low risk at higher elevations and steep terrain, (b) past forest loss in the surrounding region, (c) protected area status, and (d) human population parameters such as distance to large settlements and local population density, in that order of influence.

Methods

Study area

The analysis extent is the entire island of Borneo, with separate models constructed for Brunei, the Indonesian provinces of Kalimantan, and for Malaysian Borneo (combining the provinces of Sabah and Sarawak). The extent is 731,058 km2.

Land cover change data for 2000–2010

We used land cover maps produced for the years 2000 and 2010 by the Centre for Remote Imaging, Sensing and Processing (CRISP; Miettinen et al. 2011, 2012). These maps were based on MODIS surface reflectance product images (NASA 2010), Shuttle Radar Topography Mission (SRTM) 90 m version 4 digital elevation information (Jarvis et al. 2006) and several peatland distribution maps (Wahyunto and Subagjo 2003, 2004; Selvaradjou et al. 2005; Wahyunto et al. 2006). Miettinen et al. (2011) used a three step classification process to classify the landscape into 13 categories: water, mangrove, peat swamp forest, lowland forest, lower montane forest, upper montane forest, plantation/regrowth, lowland mosaic, montane mosaic, lowland open, montane open, urban and large scale palm plantation. Accuracy assessment for these maps found overall accuracies of 83% for 2000 and 85% for 2010, with class accuracies between 75 and 85% for the non-forest classes and up to 97% for the forest classes (Miettinen et al. 2011, 2012). Overall, the accuracy assessment of these maps compares favourably to other studies (e.g., Clark et al. 2010: 79.3%; Friedl et al. 2010: 75%; Rozenstein and Karnieli 2011: 81%; Rodriguez-Galiano et al. 2012: 92%). Our analysis is based on combining all forest classes and non-forest classes into a binary forest/non-forest map. Peat swamp forest, lowland forest, lower montane forest and upper montane forest were reclassified to forest, while all other areas were considered to be non-forest for our analysis. Category aggregation (Aldwaik et al. 2015) for the 2010 land cover map presented in Miettinen et al. (2012) resulted in a classification where forest has a producer’s accuracy of 93.2% and a user’s accuracy of 91.9%, while not forest has a producer’s accuracy of 93.9% and a user’s accuracy of 94.9.

Response variable for analysis

The response variable in our analysis was a binary GIS layer depicting areas of the study area that were forest in 2000 and became non-forest or regenerating forest in 2010 (value of 1 or “forest loss”), and areas that were forest in 2000 and remained forest in 2010 (value of 0 or “persistence”). We produced maps of all pixels that were loss or persistence in the 2000–2010 time-period and used these as the source of the loss and persistence cells for model training and model assessment. Non Forest at 2000 was not part of the model training or assessment.

Predictor variables for analysis

A priori, we proposed several environmental and anthropogenic variables as predictors of recent forest loss (Table 1). Anthropogenic variables included: distance to large population centres, defined as areas with greater than 100 people/km2; regional population density, defined as the focal mean of population density within a 100 km radius; local population density defined as the point population density at the pixel scale of 250 m; and a map of designated protected areas.

Table 1 Predictor variables used in the analysis

Topographical variables included elevation from the ASTER DEM and several terrain complexity measures produced using the Geomorphometry and Gradient Metrics Toolbox (ArcGIS 10.0; Evans and Oakleaf 2012). These included: topographical roughness, which measures the topographical complexity of the landscape within a defined focal extent (Blaszczynski 1997; Riley et al. 1999), and relative slope position, which measures the relative position of the focal pixel within a defined extent on a gradient from valley bottom to ridge top (Evans et al. 2014). Given that topographical factors may be related to deforestation at a range of spatial scales (Wiens 1989), we calculated topographical roughness and relative slope position at six spatial extents including focal radii of 1, 10, 20, 30, 40 and 50 km. These were chosen because 1 km is represents a single pixel and its very near neighbours, while the other scales span a range from local to regional neighbourhoods.

We also included FRAGSTATS metrics quantifying the extent and configuration of different land cover classes across a range of focal extents as predictor variables. The classes used in the analysis include: (1) water, (2) mangrove, (3) peat swamp forest, (4) lowland forest, (5) lower montane forest, (6) upper montane forest, (7) plantation or regrowth, (8) lowland mosaic, (9) montane mosaic, (10) lowland open, (11) montane open, (12) Urban (Miettinen et al. 2011, 2012). For each of these classes we used FRAGSTATS 4.0 (McGarigal et al. 2012) to calculate the percentage of the focal landscape covered by each class within the six focal window extents (1, 10, 20, 30, 40 and 50 km). In addition to class-level percentage of the landscape in each cover class, we calculated four landscape-level metrics quantifying the structure of the full multi-class mosaic to assess how landscape complexity was related to deforestation risk. These landscape metrics were edge density (Edge[x]), patch density (Patch[x]), aggregation index (Aggregation[x]; which measures the compaction or aggregation of cover types), and Shannon’s diversity index (Shannon[x]; which measures the diversity of the landscape mosaic within a window in terms of the richness and evenness of the cover types present). Each of these four metrics was also calculated for focal windows at the six spatial extents. FRAGSTATS metrics were calculated for the land-use maps in both the year 2000 and 2010 (Miettinen et al. 2011, 2012), the data from the year 2000 was used to calibrate the models while the 2010 data was used to predict the risk of future forest loss. All spatial layers were resampled to a 500 m pixel size and projected to an Albers conformal conic projection for analysis.

Modelling approaches

We used random forest machine learning and logistic regression to predict forest loss risk in the 2000–2010 time period from landscape conditions in 2000, separately for each nation within the study area (Brunei, Malaysia, Indonesia), to assess how the drivers of deforestation may differ between nations. Random forest is a classification and regression tree (CART; De’ath and Fabricius 2000) based bootstrap method that corrects many of the known issues in CART, such as over-fitting (Breiman 2001; Cutler et al. 2007), and provides very well-supported predictions with large numbers of independent variables (Cutler et al. 2007). We used a modelling approach developed by Evans and Cushman (2009) to predict occurrence probabilities of deforestation for each nation using the random forest method (Breiman 2001; Cutler et al. 2007) as implemented in the package ‘random Forest’ (Liaw and Wiener 2002) in R (R Development Core Team 2008). We conducted a parallel set of analyses with logistic regression, for each nation using the same training data and predictor variable data set applied to the same landscape in the same time period as in the random forest models, enabling robust comparison of the performance of random forest and logistic regression.

We conducted the random forest and logistic regression analyses in two steps. First, we ran univariate models across the multiple scales to identify the scale at which each variable had the strongest ability to predict forest loss in 2010 from the landscape condition in 2000, as suggested by McGarigal et al. (2016) as a robust approach for multi-scale model optimization. To accomplish this, we ran a series single random forest analyses for each variable across the six scales in each nation and used the model improvement ratio (MIR; Murphy et al. 2010) to measure the relative predictive strength of each scale of the variable. The MIR calculates the permuted variable importance, represented by the mean decrease in out-of-bag error, standardized from zero to one. We compared the MIR scores for all scales for each variable, and retained the scale that had the highest MIR score for further multivariate modelling. In the logistic regression we used the same scales as identified in the random forest model to ensure the models were using identical input data, which is essential for a strict comparison of performance.

In the second step we used random forest and logistic regression to develop multivariate models predicting probability of forest loss in each nation in the 2000–2010 time period as a function of landscape condition across the suite of scale-optimized variables in the year 2000. To identify the most parsimonious random forest model for each nation we applied the Model Improvement Ratio (MIR; Murphy et al. 2010). In model selection using MIR, the variables are subset using 0.10 increments of MIR value, with all variables above the threshold retained for each model. This subset is always performed on the original model’s variable importance to avoid over-fitting (Svetnik et al. 2004). We compared each subset model and selected the model that exhibits the lowest total out-of-bag error and lowest maximum within-class error. To identify the most parsimonious logistic regression model for each nation we used the same training and predictor variable set as in the random forest model for that nation, and employed all-subsets logistic regression with model averaging, using the “Dredge” function in the R packing MuMIN (e.g., Timm et al. 2016; Chambers et al. 2016).

Model predictions for the random forest model were created by using a ratio of the majority votes-matrix to create a probability distribution. Random Forest makes predictions based on the plurality of votes across all bootstrap trees and not on a single rule set. This votes matrix can be scaled and treated as a probability given the error distribution of the model. We used the function that (Evans and Cushman 2009) added to GridAsciiPredict (Crookston and Finley 2008) which uses the votes-probability function to write the probabilities to ASCII grid(s). Model predictions for the logistic regression model were created by calculating p = exp(z)/(1 + exp(z)), where p is the probability of forest loss, z is the linear combination of the model averaged coefficients multiplied by the independent variables.

Effects of training ratio of loss and persistence

We assessed the effects of varying the ratio of loss and persistence points by randomly choosing a relatively large number of persistence points for each nation and systematically varying the number of loss points, refitting the random forest and logistic regression models at each increment. For Kalimantan and Malaysian Borneo we chose 20,000 persistence cells, while in Brunei, which is much smaller, we chose 2311 persistence cells (25% of all persistence cells), and varied the proportion of loss cells at four intervals above and below the true ratio, rerunning the random forest and logistic regression predictor models at each ratio. To assess the effects of the ratio of loss to persistence cells in the training sample we calibrated the number of loss cells until the sum of the predicted probability map matched the observed number of loss cells in the nation over the 2000–2010 period. A simple simulation demonstrated that when the sum of predicted probability equals the number of actual loss cells, the prediction is unbiased in terms of the amount of predicted loss. We fit logarithmic regressions to the relationship between ratio of persistence and loss cells in the training dataset and bias in the predicted probability map for each nation.

Naïve model and model performance

Pontius et al. (2007) urges authors to compare any statistical landscape change model to a naïve model applied to the same landscape. Comparison to random allocation is not an appropriate naïve model because the factors driving forest loss are not random. For example, a naïve model could predict forest loss simply near previous forest loss. We constructed such a naïve model by making risk of forest loss inversely proportional to the distance from non forest at 2000 to the 10th power. This creates a nonlinear relationship such that the risk of forest loss in the naïve model is 1 at the edge of non forest at 2000, but drops nonlinearly as one moves away from the edge into forest interior, reflecting the observed pattern that deforestation risk decreases faster than linearly with distance into existing forest (Fig. 1).

Fig. 1
figure 1

Map of Borneo showing areas of forest loss between 2000 and 2010 in yellow, areas of forest persistence from 2000 to 2010 in green, and areas that were not forest in 2000 in black. (Color figure online)

There are a multitude of ways to assess the performance of predictions of forest loss, and most previous studies have used the Kappa statistic (Cohen 1960) and similar measures of improvement of predicted classification compared to random assignment. However, following Ponitus and Milones (2011), we eschewed the use of the Kappa statistic given that it does not report a meaningful statistical measure of predictive success, even when corrected to address the two different aspects of prediction related to predicted amount and predicted allocation (Pontius and Si 2014). In addition, the predictions produced by random forest and logistic regression are in the form of predicted probabilities, thus it is more meaningful to assess the continuous pattern or predicted probability in comparison to the actual observed changes than to cross-tabulate observed versus a single Boolean predicted change (Pontius and Si 2014). Cross tabulation requires use of a single threshold to transform the predicted probabilities into a Boolean response, which loses information concerning the various probabilities. We assessed the performance of the random forest and logistic regression predictions using area under the total operating characteristic curve, as suggested by Pontius and Si (2014) and Pontius and Parmentier (2014). We produced visualizations of the predicted probability of forest loss compared to the actual distribution of forest loss and persistence cells in which we overlaid the forest loss points on the predicted probability surface to visually display the association between predicted probabilities and observed changes.

Prediction to the 2010–2020 time period

Once random forest and logistic regression models had been produced for the 2000–2010 time period for each nation (Brunei, Malaysia, Indonesia), we applied these models to predict the risk of forest loss in the 2010–2020 time period by calculating the value of the predictor variables in the year 2010 and applying the models produced in the 2000–2010 time period to them. Having computed the predicted probability of forest loss in each cell over the 2010–2020 time period, we then calculated the proportion of each national territory in Borneo with forest loss risk greater than 25, 50 and 75% between 2010 and 2020.

Results

Model calibration

The calibrations of the number of loss to persist points used in the training sample for each of the three nations revealed a very clear result. In both the random forest calibration and the logistic regression calibration, the predicted amount of forest loss was biased upward when there were a higher proportion of loss points in the training sample than the actual landscape (Fig. 2). Conversely, predicted forest loss was biased downward when there were a lower proportion of deforested samples in the training data than in the actual landscape. An unbiased prediction of the amount of forest loss was obtained when the ratio of deforested to not-deforested points in the training sample matched that in the actual landscape, except in logistic regression when the actual proportion of deforested cells was very low (Brunei), in which case logistic regression over-predicted forest loss at all ratios of deforested to not-deforested points. In addition, there was a very tight logarithmic relationship predicting the amount of bias (Fig. 2). In all three nations, the logistic regression calibration showed higher sensitivity to the ratio of loss and persist cells, with larger coefficient to the logarithmic equation leading to larger bias as the ratio departed the true ratio.

Fig. 2
figure 2

Calibration of forest loss versus forest persistence points for a Brunei, b Malaysian Borneo and c Kalimantan. The x-axis reports the ratio of forest loss to forest persistence points in the training sample. The vertical line reflects the actual ratio in observed forest loss versus forest persistence points between 2000 and 2010. The y-axis reports the ratio of over/under prediction of forest loss. Positive numbers reflect over prediction of forest loss by that proportion, while negative numbers reflect under prediction. The random forest calibration and trend line is shown with circular markers, with a logarithmic fit. The logistic regression calibration and trend line is shown with triangular markers, with logarithmic fit

Observed and predicted forest loss amount in the 2000–2010 period

In the period between 2000 and 2010 the Miettinen et al. (2011) maps report that 5.1% of Brunei, 23.76% of Malaysian Borneo and 15.44% of the forest present Kalimantan in 2000 was lost by 2010. The calibrated random forest model predicted risk surfaces for these nations that produced expected values of loss that were within ½ of 1% of the actual observed amount of forest loss in each case (Table 2).

Table 2 Tabulation of number of forested pixels persisting and lost, and proportion loss between 2000 and 2010 for the three nations in the Borneo area

Model performance

Across the three nations, the calibrated random forest model out-performed the logistic regression and naïve models based on the TOC AUC value, which ranged from 0.931 in Malaysian Borneo, to 0.928 in Kalimantan and 0.917 in Brunei (Table 3). In Malaysian Borneo and Kalimantan the random forest model not including FRAGSTATS landscape variables was the second highest performing model with AUC of 0.898 and 0.890 respectively. In contrast, in Brunei the calibrated logistic regression and the naïve models both outperformed the random forest model not including landscape metrics based on TOC AUC (0.853 and 0.833 vs. 0.805). In the two large study areas, Kalimantan and Malaysian Borneo, the naïve model based on a power function of distance into forest from its edge performed weakest (Table 3).

Table 3 Model performance assessed with the TOC for assessing accuracy of continuous predicted probability surfaces, relative to the naïve model

The display of the TOC curves (Fig. 3) illustrate several things not illustrated by the AUC table alone. TOC shows for each threshold the hits, misses, false alarms, and correct rejections (Pointius and Si 2014). The vertical distance from the horizontal axis to the TOC curve equals Hits, while the vertical distance from the TOC curve to the Hits + Misses line equals Misses. The maximum height of the TOC curve is equal to Hits + Misses, which indicates the size of the forest loss. The horizontal distance from the bounding parallelogram on the left to the TOC curve equals false alarms, and the horizontal distance from the TOC curve to the parallelogram on the right equals correct rejections. The most notable is the compression of the TOC curves in Kalimantan, with the naïve model performing quite close to the logistic regression model, and the logistic regression model quite close to the random forest models. This indicates that the pattern of deforestation in Kalimantan is quite similar to the naïve model as a function of distance from the edge of past deforestation. In contrast the higher spread of the curves in Malaysian Borneo indicates that forest loss in this study area is driven by factors not reflected well in the naïve model. In addition, in Malaysian Borneo there is a relatively large gap between the logistic regression and the two random forest models as well, indicating relatively higher performance of random forest as compared to logistic regression.

Fig. 3
figure 3

TOC curves showing comparative model performance among the logistic regression, random forest, random forest without fragmentation variables, and the naïve model. Higher model performance is indicated by stronger convex curvature toward the upper left corner of the plot space. a Brunei, b Malaysian Boreno, c Kalimantan

Model interpretation

Our analysis produced nine different predictive models, consisting of random forest with and without landscape metrics, and logistic regression models for each of the three nations comprising Borneo. In the interest of space we briefly describe the highest performing model (random forest with landscape metrics) for each nation here, and present full information on all nine models and their interpretation in Appendix 1.

Brunei model

The ten most important variables in the calibrated random forest Brunei model based on model improvement ratio were aggregation index at 10 km radius, edge density at 10 km radius, proportion of peat swamp forest at 40 km radius, proportion of water at 50 km radius, proportion of plantation or regrowth at 1 km radius, proportion of lowland forest at 40 km radius, Shannon Diversity at 10 km radius, Topographical Roughness at 40 km radius, Elevation and proportion of lowland mosaic at 40 km radius (Figure A1). Of the ten most influential variables in the prediction of deforested vs not-deforested cells in Brunei between 2000 and 2010, as judged by model improvement ratio, five had positive monotonic relationships with increasing probability of deforestation with increasing value of the variable (edge density at 10 km radius, proportion of peat swamp forest at 30 km radius, proportion of water at 50 km radius, proportion of plantation or regrowth at 1 km radius, Shannon Diversity Index at 10 km radius; Figure A2). Two variables, including the most influential variable (aggregation index at 10 km radius), had negative monotonic relationships with deforestation risk, such that deforestation risk decreases as the value of these variables increase. Two variables had unimodal relationships such that the frequency of deforestation was maximum at intermediate values (proportion of lower montane forest at 40 km radius and Distance to Population Centre). We produced visualization of the pattern of predicted forest loss probability across Brunei (Figure A3), with zoomed-in view of two areas showing the pattern of observed loss in relation to the predicted probability of loss (Figure A4).

Malaysian Borneo model

The ten most important variables, based on model improvement ratio, in the calibrated Malaysian Borneo model were proportion of lowland mosaic at 20 km radius, proportion of plantation or regrowth at 30 km radius, edge density at 20 km radius, Elevation, proportion of montane mosaic at 50 km radius, patch density at 30 km radius, proportion of lower montane forest at 50 km radius, Shannon diversity index at 40 km radius, focal mean population density at 100 km radius and topographical roughness at 40 km radius (Figure A6). Of the ten most influential variables in the calibrated random forest model in Malaysian Borneo, six had positive monotonic relationships wherein frequency of forest loss increased as the value of the variable increased (proportion of lowland mosaic at 20 km radius, proportion of plantation or regrowth at 30 km radius, edge density at 20 km radius, patch density at 30 km radius, Shannon diversity at 40 km radius, and focal mean population density at 100 km radius). All of these showed strongly non-linear relationships with rapid initial rise followed by asymptotic flattening (Figure A7). Four of the ten most influential variables in the calibrated Malaysian Borneo random forest model showed monotonic negative relationships where frequency of deforestation between 2000 and 2010 declined as the value of the variable increased. These mostly showed what appear to be negative exponential shapes, with rapid decline at low values of × followed by flattening as the × variable increased (elevation, proportion of montane mosaic at 50 km radius, proportion of upper montane forest at 50 km radius, and topographical roughness at a 40 km radius; Figure A7). We produced visualization of the pattern of predicted forest loss probability across Brunei (Figure A8), with zoomed-in view of two areas showing the pattern of observed loss in relation to the predicted probability of loss (Figure A9).

Kalimantan model

The ten most important variables, based on model improvement ratio, for Kalimantan were elevation, patch density at 40 km radius, proportion of lowland mosaic at 50 km radius, proportion of lower montane forest at 20 km radius, proportion of plantation or regrowth at 50 km radius, edge density at 10 km radius, focal mean population density at 100 km radius, topographical roughness at 50 km radius, proportion of lowland open at 50 km radius and Shannon’s diversity index at 40 km radius (Figure A11). Of the ten most important variables in the calibrated random forest model predicting deforested versus not-deforested cells between 2000 and 2010, seven had positive monotonic relationships (patch density at 40 km radius, proportion of lowland mosaic at 50 km radius, proportion of plantation or regrowth at 50 km radius, edge density at 10 km radius, focal mean population density at 100 km radius, proportion of lowland open at 50 km radius, and Shannon diversity at 40 km radius; Figure A12). As in the Malaysian Borneo model, many of these were strongly nonlinear relationships. The remaining three top variables all had negative monotonic relationships (elevation, proportion of lower montane forest at 20 km, topographical roughness 50 km radius). As in the Malaysian Borneo model, these were strongly nonlinear, with most showing a negative exponential shape (Figure A12). We produced visualization of the pattern of predicted forest loss probability across Kalimantan (Figure A13), with zoomed-in view of two areas showing the pattern of observed loss in relation to the predicted probability of loss (Figure A14).

Predicted forest loss risk in 2010–2020

We produced a map of predicted forest loss risk in the 2010–2020 time period (Fig. 4) and calculated the expected value of amount of forest loss and as proportion of each nation by averaging the pixel values of predicted forest loss risk across all pixels that were forest in 2010 (Table 4). The rate of forest loss predicted in the 2010–2020 period was substantially higher in Malaysian Borneo than in Kalimantan or Brunei (23.2 vs. 15.9%, Table 4). The predicted rate of forest loss in 2010–2020 was quite similar into that observed in 2000–2010 for Malaysian Borneo (23.2 vs. 23.3%) and Kalimantan (15.9 vs. 15.4%), while the predicted amount of forest loss in Brunei increased to 15.9% from the observed amount of 5.9% in 2000–2010.

Fig. 4
figure 4

Predicted probability of deforestation in the period 2010–2020, obtained by applying the calibrated random forest models developed in the 2000–2010 period to the landscape data in 2010 for each of the three study nations (Brunei, Malaysian Borneo, Kalimantan). Black represents areas of non-forest in 2010. (Color figure online)

Table 4 Expected number of pixels predicted to be deforested in each of the three study regions in the period between 2010 and 2020 total and as a proportion of the amount of forest existing in 2010

We calculated the proportion of each study region in four bins of degree of risk (<25%, 25–50%, 50–75% and >75% predicted probability of forest loss in the 2010–2020 period; Fig. 5). This analysis showed a difference in the distribution of risk among study regions, which is visually apparent from inspection of Fig. 4. Specifically, Malaysian Borneo has a “flatter” distribution of proportion of pixels across the four risk bins, indicating that a larger proportion of the land area of Malaysian Borneo has moderate risk (33% of forest has 25-50% predicted probability of forest loss in 2010–2020), and a lower proportion has low risk (60% has less than 25% probability of forest loss in 2010–2020), than the other two regions. In Fig. 4 this is seen as a broader area of yellow, indicating moderate risk, and less area of blue (low risk). In contrast, Kalimantan shows a pattern with a high proportion of the land area with low risk (75.5% less than 25% probability of forest loss 2010–2020) and a comparatively low proportion of area with intermediate risk (17.9% has between 25 and 50% probability of forest loss between 2010 and 2020). Both Malaysian Borneo and Kalimantan have similar proportion of high risk, with 6.7 and 6.3% of forest area having between 50 and 75% probability of forest loss in 2010–2020.

Fig. 5
figure 5

Proportion of forest area existing in 2010 in each of the three study regions predicted to have risk of deforestation in the 2010–2020 time period of (1) less than 25%, (2) between 25 and 50%, (3) between 50 and 75%, and (4) greater than 75%

Discussion

The goal of this study was to identify the multi-scale drivers of forest loss across Borneo, and then to apply this model to predict and map the risk of forest loss between 2010 and 2020. This appears to be the first study to: (1) quantitatively compare the performance of random forest and logistic regression in predicting forest loss risk, (2) conduct formal multi-scale model optimization to predict forest loss risk, (3) formally evaluate the effect of the ratio of forest loss to persistence cells on the predicted probability of forest loss, and (4) evaluate the utility of including landscape metrics quantifying the composition and configuration of a landscape mosaic on predicted risk of future forest loss.

We identified several hypotheses to help us explore these relationships. Contrary to our first hypothesis and the findings of Chawala et al. (2003) and Chen et al. (2004), we found that the probability maps produced by random forest do not under predict loss when the ratio of loss in the training sample was equal to that in the real landscape. On the contrary, our calibration results showed that both logistic regression and random forest produce unbiased predicted probability of forest loss only when the ratio of loss to persisting forest in the training sample matches that in the actual landscape. The bias increases substantially when the ratio departs from that of the actual landscape, with logistic regression substantially more sensitive than random forest to this bias. There are a number of implications of this observation that extend beyond the present study. Following Chawala et al. (2003) and Chen et al. (2004), most users of random forest have purposefully skewed the ratio of training points away from that of the actual population, intending to avoid bias (e.g., Evans and Cushman 2009). Our results clearly show this actually introduces bias and we advocate for future users of random forest to carefully select training data that are representative and proportional to the composition of actual landscape.

Second, following other researchers who found that random forest outperforms other methods for prediction and classification (e.g., Cushman et al. 2010; Evans et al. 2011; Rodriguez-Galiano et al. 2012; Schneider 2012), we expected that random forest would outperform logistic regression and the naïve model based on distance to forest edge. Consistent with this hypothesis, random forest was the highest performing method in all three nations comprising Borneo. This confirms the utility of random forest as a modelling tool for predicting forest loss, and we suggest that future studies use this powerful technique, with correct ratio of loss and persisting locations in the training sample.

Consistent with our third hypothesis we found that models that included landscape variables had substantially higher performance than models that excluded them. In all three nations the random forest model that included landscape variables outperformed the random forest model that excluded them. One of the central tenants of landscape ecology is that patterns influence processes across a range of scales (Turner 1989; Wiens 1989). However, to our knowledge, this paper is the first using multiple scale optimization (sensu McGarigal et al. 2016) of the effects of risk of forest loss, and the first to utilize landscape metrics (e.g., FRAGSTATS, McGarigal et al. 2012). Our analysis shows that landscape pattern variables were important predictors, with the majority of the ten most influential variables in each nation measuring aspects of landscape composition or configuration.

In our fourth hypothesis we predicted that risk of forest loss would be related to predictor variables similarly across the three nations. Contrary to this expectation we noted substantial differences between the nations. Most notably, the pattern of forest loss risk in Kalimantan appears to be driven primarily by elevation and distance to the edge of previous forest loss, while that in Malaysian Borneo is more diffuse, less associated with proximity to previous forest loss, and more driven by landscape heterogeneity at broad scales. This likely reflects differences between the socio-economic drivers of forest loss in these two nations. We further explored this observation qualitatively by inspecting high resolution Google Earth satellite imagery for several landscapes in Malaysian Borneo and Kalimantan. We observed a general pattern of difference between Malaysian Borneo and Kalimantan, wherein forest landscapes in Malaysian Borneo, particularly those in Sarawak at lower elevations, typically have large networks of logging roads that permeate currently unlogged forest, providing diffuse access for future harvest that matches the pattern of our prediction, while in Kalimantan we observed few such diffuse road networks and in most Kalimantan landscapes roads do not extend deeply into currently unlogged forest, leading to the “nibbling at the edge” pattern we observed and the similarity to the distance to edge naïve model.

Our fifth hypothesis proposed that forest loss risk would be related to (a) topography, with high risk at low elevations and flat areas, and low risk at higher elevations and steep terrain, (b) past forest loss in the surrounding region, (c) protected area status, and (d) human population parameters such as distance to large settlements and local population density, in that order of influence. Our results were mixed regarding these predictions. Only in Kalimantan were any of the topographical variables the most important based on Model Improvement Ratio, with elevation having the largest effect, and topographical roughness having the 7th highest effect. In Malaysian Borneo elevation was the fourth most influential variable and topographical roughness was the 10th, while in Brunei topographical roughness and elevation were the seventh and eighth most influential variables. In Malaysian Borneo the two most influential variables were extent of the local landscape covered in areas that have had previous forest loss (lowland mosaic and plantation/regrowth), while in Kalimantan these were the third and fifth most important predictors, which supports our expectation that extent of previous forest loss in a local landscape would be a strong predictor of future loss. In all three nations, however, we found a strong effect of landscape configuration metrics such as edge density (#3 Malaysia, #2 Brunei, #8 Kalimantan), patch density (#6 Malaysia, #2 Brunei, #2 Kalimantan) and aggregation index (#1 Brunei) and Shannon landscape diversity (#8 Malaysia, #7 Brunei, #10 Kalimantan). In all three nations these landscape configuration metrics, which measure the total heterogeneity and fragmentation of the landscape, were more important than either protected area status or human population density or distance to population centers. This suggests that heterogeneous landscape configurations created by past forest loss are highly predictive of future forest loss. Importantly, in none of the three nations was protected area status selected by the variable selection, suggesting that across Borneo protected area effectiveness may be low in minimizing forest loss and/or that most protected areas are in locations of very low predicted risk based on other factors (such as high elevation areas that have high topographical roughens and high distance from past forest loss).

Predicted rates and patterns of risk of forest loss in 2020

We predicted that the rates and patterns of forest loss in the 2010–2020 period are likely to be similar to those seen in the 2000–2010 period. Forest loss in Indonesian Borneo is predicted to continue to move along fronts of contagious expansion from previously logged areas, spreading into the forest and leaving very few patches behind. Conversely, forest loss in Malaysian Borneo we predict will continue to expand more diffusely throughout the landscape, leading to much more extensive and heterogeneous patterns of forest loss. These differences reflect the differences in these nations in the extensiveness of forest road networks. We predicted very similar total amounts of forest loss in the 2010–2020 period as observed in the 2000–2010 period in both Malaysian Borneo and Kalimantan, with Malaysian Borneo continuing to have one of the highest forest loss rates reported in the world, and Kalimantan having a lower but still very high rate of forest loss by global standards.

Conclusion

Our results have several important implications. We confirm reports that random forest out-performs logistic regression for prediction of forest loss in our study areas. In doing so, and in contrast to common practice, we demonstrate that unbiased results were achieved only when the proportion of the loss class in the training sample matched the proportion in the actual landscape. We also confirm that multiple-scale modelling using landscape metrics as predictors in a random forest framework is a powerful approach to landscape change modelling. There is immense immanent risk to Borneo’s forests, with clear spatial patterns of risk related to topography and landscape structure that differ between the three nations that comprise Borneo. There are substantial differences in the spatial drivers and patterns of deforestation in Kalimantan compared to Malaysian Borneo, where in the former forest loss is highly associated with the edge of previous loss, and is a spatially contagious “nibbling” on the remaining forest patches, while in the latter extensive road networks built along ridgetops lead to widespread, diffuse and highly fragmenting patterns of forest loss.