Support vector machines to map rare and endangered native plants in Pacific islands forests
Highlights
► SVM paradigm is adapted to rare species distribution modeling. ► SVM can be more accurate than random forests. ► Target species rarity is a consequence of past and present human impacts.
Introduction
The detailed knowledge of rare species ecological range and geographic distribution is critical for biodiversity conservation and management (Ferrier, 2002, Rushton et al., 2004). Oceanic islands are famous for their unique biota with high endemism, but also their great vulnerability to anthropogenic disturbances (Caujapé-Castells et al., 2010, Loope et al., 1988) causing the decline of species abundance and distribution, leading sometimes to extinction (Whittaker and Fernandez-Palacios, 2007). As a result, a huge number of endangered species are currently found on island ecosystems (IUCN, 2011). Besides their conservation value, rare species may also play a key role for ecosystem functioning (Lyons and Schwartz, 2001, Lyons et al., 2005).
Occurrence records are scarce for rare species resulting in small training sample available for species distribution models (Pearson et al., 2007, Stockwell and Peterson, 2002, Wisz et al., 2008). A recent study of Williams et al. (2009) compared the ability of a range of models to predict distribution of six rare plant species (from 9 to 129 occurrences). These models included generalized linear models, artificial neural networks, the commonly used maximum entropy (Maxent) distribution and a classification and regression tree (CART) model called random forests (RF) (Breiman, 2001), the latter outperforming the former. RF, introduced by Breiman (2001), is an ensemble classifier developed to produce accurate predictions while limiting overfitting of the data. It consists of many decision trees and outputs the class that occurs most frequently in individual trees. Each input vector is used by each tree of the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes over all the trees in the forest. RF has been recently and successfully used for species distribution modeling (Benito Garzon et al., 2008, Cutler et al., 2007, Prasad et al., 2006, Williams et al., 2009). RF is an easy to use classifier since it has only two parameters that the user has to determine. They are the number of trees to be used and the number of variables to be randomly selected from the available set of variables.
Nonetheless, in the field of remotely sensed data classification, a machine learning algorithm called the support vector machines (SVM) (Vapnik, 1998) may be an important technique for modeling rare species distributions. Algorithms used in remotely sensed data classification for classifying object reflectance are substantially the same than those used in species distribution models for classifying environmental layers (Franklin, 1995). Thus, SVM was successfully used for common species distribution modeling in few recent studies (Drake et al., 2006, Guo et al., 2005, Pouteau et al., 2011a).
SVM was originally introduced as a binary classifier (Vapnik, 1998) and is extensively described by Burges (1998), Hsu et al. (2009) and Schölkopf and Smola (2002). In its classical implementation, it uses two classes (e.g. presence/absence) of training samples within a multidimensional feature space to fit an optimal separating hyperplane (in each dimension, vector component is image gray-level). In this way, SVM tries to maximize the margin that is the distance between the closest training samples, or support vectors, and the hyperplane itself.
SVM consists of projecting vectors into a high dimensional feature space by means of a kernel trick then fitting the optimal hyperplane that separates classes using an optimization function. For a generic pattern x, the corresponding estimated label ŷ is given by Eq. (1).
wherein N is the number of training points, the label of the ith sample is yi, b is a bias parameter, K(xi,x) is the chosen kernel and αi denotes the Lagrangian multipliers.
Several kernels are used in the literature. According to Hsu et al. (2009) and supported by many other authors, the Gaussian radial basis function (RBF) has both advantages (i) of being very successful since it works in an infinite dimensional feature space; and (ii) having a single parameter γ > 0, contrary to the other well working kernels (e.g. polynomial). The equation is Eq. (2).
Noise in the data can be accounted for by defining a distance tolerating the data scattering, thus relaxing the decision constraint. This regularization parameter is called C.
Only αi belonging to support vectors si has no null value so the classification function is actually Eq. (3).
wherein Ps is the number of support vectors. Thus, the decision boundary is solely based on few meaningful pixels. This is why SVM may be much appropriated for predicting distribution of species with scarce occurrence records. Nevertheless, to our knowledge, it has never been used for rare species distribution modeling.
The aim of this study is twofold: (i) to determine which model among RF and SVM is the most relevant to map rare species in a study case focusing on endangered native and endemic plants on Pacific islands; and (ii) comparing their predicted potential habitat with their current observed range, to understand the causes of their rarity and endangerment.
Section snippets
Target rare and endangered species
The present study was conducted on the oceanic tropical island of Moorea (Society archipelago, French Polynesia), located at 17°33′ South and 149°50′ West in the South Pacific Ocean. It is a small (ca. 140 km2) and young volcanic island (1.5–2.5 million years old) with a rough topography and the highest summit reaching 1207 m elevation.
This work was part of the “Moorea Biocode Project”, an international research program seeking to collect DNA sequence, distribution, morphological and ecological
Vegetation map
The SVM classification of the Quickbird imagery (Fig. 5) gives fairly good results with a kappa of 0.842 and an AUC of 0.965. Texture is arguably the most contributing information since the classification based on the single textural information (without spectral bands) gives a kappa of 0.821 and an AUC of 0.955 (data not shown).
Contribution of biophysical descriptors
Calculation of the descriptors relative contribution presented in Fig. 6 was based on the difference of AUC (Δ AUC) and the difference of kappa (Δ kappa) yielded with
Random forests vs. support vector machines
RF and SVM were compared on their ability to predict rare and endangered species distributions. RF was found to be optimal for predicting rare species occurrences among a wide panel of algorithms in Williams et al. (2009). To our knowledge, SVM has never been used for predicting rare species distribution. However, it generally outperforms RF in our study case, especially when the number of occurrence is small. The main reason is most likely the result of the paradigm of SVM based on a small
Conclusion
We compared two ecological niche models, random forests (RF) and support vector machines (SVM), in order to predict the distribution of rare species in island forest ecosystems. Our analysis focused on three endangered native and endemic plants on the tropical oceanic island of Moorea (French Polynesia) with small occurrence records. It was based on six fine scale environmental descriptors, namely elevation, slope steepness, slope aspect, windwardness, a compound topographic index (CTI)
Acknowledgments
The authors are grateful to Jean-François Butaud for sharing his GPS points of the target plants, Marie Fourdrigniez for her help during field surveys, the Service de l'Urbanisme of the Government of French Polynesia for providing the DEM, the Délégation à la Recherche of the Government of French Polynesia and the “Moorea Biocode Project” for financial support. We deeply thank Thomas W. Gillespie (Department of Geography, University of California, Los Angeles) for revising the English on an
References (81)
- et al.
Conservation of oceanic island floras: present and future global challenges
Perspectives in Plant Ecology, Evolution and Systematics
(2010) - et al.
Efficient optimization of support vector machine learning parameters for unbalanced datasets
Journal of Computational and Applied Mathematics
(2006) - et al.
The use of small training sets containing mixed pixels for accurate hard image classification: training on mixed spectral responses for classification by a SVM
Remote Sensing of Environment
(2006) - et al.
Automated derivation of geographic window sizes for remote sensing digital image texture analysis
Computers and Geosciences
(1996) - et al.
A comparison of spatial feature extraction algorithms for land-use classification with SPOT HRV data
Remote Sensing of Environment
(1992) - et al.
Support vector machines for predicting distribution of Sudden Oak Death in California
Ecological Modelling
(2005) - et al.
Indirect remote sensing of a cryptic forest understorey invasive species
Forest Ecology and Management
(2006) - et al.
A positive and unlabeled learning algorithm for one-class classification of remote-sensing data
IEEE Transactions on Geoscience and Remote Sensing
(2011) - et al.
Comparison of algorithms for classifying Swedish landcover using Landsat TM and ERS-1 SAR data
Remote Sensing of Environment
(2000) - et al.
A SVM-based model for predicting distribution of the invasive tree Miconia calvescens in tropical rainforests
Ecological Modelling
(2011)
Effect of sample size on accuracy of species distribution models
Ecological Modelling
Remote sensing for biodiversity science and conservation
Trends in Ecology & Evolution
Performance evaluation of texture measures for ground cover identification in satellite images by means of a neural network classifier
IEEE Transactions on Geoscience and Remote Sensing
An analysis of different resampling methods in Coimbatore, District
Global Journal of Computer Science and Technology
Leaf construction cost, nutrient concentration, and net CO2 assimilation of native and invasive species in Hawaii
Oecologia
Classification of multisource and hyperspectral data based on decision fusion
IEEE Transactions on Geoscience and Remote Sensing
Neural network approaches versus statistical methods in classification of multisource remote sensing data
IEEE Transactions on Geoscience and Remote Sensing
Effects of climate change on the distribution of Iberian tree species
Applied Vegetation Science
Random forests
Machine Learning
A tutorial on support vector machines for pattern recognition
Data Mining and Knowledge Discovery
Island Biology
Microclimate in forest ecosystem and landscape ecology
Bioscience
Examining the effect of spatial resolution and texture window size on classification accuracy: an urban environment case
International Journal of Remote Sensing
Synergistic use of multi-temporal ALOS/PALSAR with SPOT multispectral satellite imagery for land cover mapping in the Ho Chi Minh city area, Vietnam
Assessing the Accuracy of Remotely Sensed Data: Principles and Practices
Random forests for classification in ecology
Ecology
Comparing the areas under two or more correlated receiver operating characteristic curve: a nonparametric approach
Biometrics
Modelling ecological niches with support vector machines
Journal of Applied Ecology
Parallel tuning of support vector machine learning parameters for large and unbalanced data sets
A combined support vector machines classification based on decision fusion
Mapping spatial pattern in biodiversity for regional conservation planning: where to from here?
Systematic Biology
Base de données botaniques Nadeaud de l'Herbier de la Polynésie française
Predictive vegetation mapping: geographic modeling of biospatial patterns in relation to environmental gradients
Progress in Physical Geography
Spectral texture for improved class discrimination in complex terrain
International Journal of Remote Sensing
Incorporating texture into classification of forest species composition from airborne multispectral images
International Journal of Remote Sensing
Modeling soil-landscape and ecosystem properties using terrain attributes
Soil Science Society of America Journal
Partial flora of the Society Islands: Ericaceae to Apocynaceae
Textural features for image classification
IEEE Transactions on Systems, Man, and Cybernetics
Eagleson's optimality theory of an ecohydrological equilibrium: quo vadis?
Functional Ecology
A practical guide to support vector classification
Cited by (73)
Mapping habitats sensitive to overgrazing in the Swiss Northern Alps using habitat suitability modeling
2022, Biological ConservationCitation Excerpt :Most predictors are based on the digital elevation model (DEM), calculated with different algorithms. They can be considered proxies at the local scale for temperature, insolation, runoff rate, soil water content, erosion potential, terrain morphometry, exposure to wind, soil thickness, etc. (Pouteau et al., 2012; Lannuzel et al., 2021), which all influence the vegetation (Wilson and Gallant, 2000). Other predictors were also used: an index indicating the abundance of visible stones and rock at the ground surface, the normalized difference vegetation index (NDVI) and land cover.
Improving litterfall production prediction in China under variable environmental conditions using machine learning algorithms
2022, Journal of Environmental ManagementAn autoencoder wavelet based deep neural network with attention mechanism for multi-step prediction of plant growth
2021, Information SciencesCitation Excerpt :This section provides a short description of existing machine learning prediction models applied to horticulture, and in particular, to plant growth analysis, which is crucial for smart farming [47]. Data-driven models (DDM) that are used in signal processing include Machine Learning (ML) models, such as Generalized Linear Models, Artificial Neural Networks [14] and Support Vector Machines [34]. Those methods have many desirable characteristics, such as: imposing few restrictions and assumptions; ability to approximate nonlinear functions; strong predictive capabilities; flexibility to adapt to multivariate system inputs [9].