1 Introduction
-
High false positive and false negative error rates from the automated detection of urban land cover classes when compared to non-urban classes (e.g., bare rocks, sand dunes, bare agricultural fields, river bank lines) due to the limited actual extent of built-up areas and the discontinuous surface they compose [21];
1.1 Background
1.2 Challenges addressed in this work
-
The necessity to develop a model flexible enough to be applied to a global carpet of satellite data entailing the design of a sound training approach, a strategy for transfer learning and a plan for the consistency verification of the classification output.
-
The substantial amount of training data required for training complex models. In the case of built-up classification, the training samples should cover different building types (e.g., residential and industrial buildings of different sizes, colors and rotations) in various types of landscapes (e.g., dense urban areas, rural areas, desert landscapes, built-up areas mixed with neighborhood green spaces);
-
The increased need for computational processing resources, especially for adjusting and fine-tuning multiple and/or complex models;
-
The requirement for CNN architectures that are robust to noise in satellite imagery (e.g., presence of snow, clouds, haze) and to other seasonal effects. This feature would enable the generalization capacity of the models over large areas and the extraction of built-up areas with comparable efficacy along the urban–rural continuum.
-
A new framework for pixel-wise large-scale classification of built-up areas from a Sentinel-2 image composite at a spatial resolution of 10 m has been developed, named GHS-S2Net (GHS stands for Global Human Settlements, S2 refers to the Sentinel-2 satellite) (Sect. 2.3);
-
A multi-neuro modeling methodology is proposed following the Universal Transverse Mercator (UTM) grid zones schema and a systematic two-stage sampling within each UTM grid zone (Sect. 2.3.1);
-
Transfer learning is implemented following two separate approaches depending on the availability of reliable training data at the different UTM zones: a close range transfer learning within each UTM grid zone and a far range transfer learning from one UTM grid zone to neighboring data-poor zones (Sect. 2.3.3). In this work, transfer learning does not obey the most dominant definition of using the weight values of pre-trained models from different domains. As a concept herein, it is closer to the verification of the generalization capacity of the models when the training and testing data do not necessarily follow similar statistical distributions;
-
An extensive assessment of the models output, that is based on an independent validation using fine-scale digital cartographic reference data reporting the footprint of every single building for 277 sites around the globe (Sect. 3.4).
2 Input data and methods
2.1 Sentinel-2 cloud-free image composite
2.2 Model input data: learning sets
2.2.1 Global Human Settlement Layer built-up areas
2.2.2 European Settlement Map
2.2.3 Facebook high-resolution settlement data
2.2.4 Microsoft building footprint data
Training set | Pixel size (m) | Coverage | Time stamp | Advantages | Constraints | BU samples (resampled at 10 m) | |
---|---|---|---|---|---|---|---|
Number of pixels | % | ||||||
GHSL_BU | 30 | Global | 2014 | Complete global coverage | Lower spatial resolution than the data under processing, thus including relatively higher error rates | 1.49E+09 | 28.29 |
ESM_BU | 2 | European | 2015 | High precision from very higher resolution input data | Limited geographical coverage, large no data zones over some cities | 5.31E+08 | 10.04 |
FB_HRS | ~ 30 | 194 countries | 2002–2017 | High precision derived by aggregation of very higher resolution input data | Limited geographical availability, systematic false negative in dense urban areas, sporadic false positives | 2.59E+09 | 49.06 |
MS_BFP | vector (rasterized at 1 m and aggregated to 10 m) | 4 countries | – | High precision with delineation of single buildings from very high resolution input data | Limited geographical availability, sporadic false negative in industrial areas, sporadic false positives in specific landscapes (Canadian lakes, mountainous areas), unknown imagery date | 6.66E+08 | 12.61 |
2.3 GHS-S2Net building blocks
-
Firstly, given that the target to be recognized ranges in size from single residences until block of contiguous buildings, the model capacity should allow the collection and distillation of the fine information provided by either the single pixels or the small sized groups of pixels consisting of homogeneous characteristics. Unlike popular tasks for natural image segmentation and object localization where there exist sizeable image regions with common characteristics (color, texture, connectivity, etc.), the size of the objects to be recognized herein varies from 10 m (the finest resolution associated with a single pixel) to some dozens of meters. Consequently, the contextual information that surrounds one pixel and accommodates the prominent features can be expressed by narrow image windows (patches) having a size of few pixels. An extensive experimentation specifically for Sentinel-2 imagery with respect to the optimal size of an image patch at which the convolution performs efficiently is presented in [71]. In the present study, an image patch of size 5 × 5 has been selected as input image to the CNN, whereas the convolution of the image is achieved through successive kernels of size 2 × 2 with stride 1 × 1. At this narrow representation and with the intention of avoiding losing essential information, no pooling layers have been employed to reduce further the spatial size.
-
Secondly, the motivation was to design a lightweight model that could serve adequately the chosen multi-modeling approach and allow several degrees of flexibility in terms of distributed computing. The total number of model parameters is 1,448,578 (1,447,042 trainable and 1,536 non-trainable), 95 times less than VGGNet [72] and 2.7 times less than GoogleNet [73] (indicative CNNs). While the number of 2D convolutional layers is limited to 4 layers and the number of flattened layers to 2, the number of parameters has been increased due to the high number of filters. Tests showed that the specific CNN topology can perform quite well even if the number of filters is smaller, yet we decided to keep the number of filters high in order for the model to capture very subtle details. This lightweight topology facilitates the algorithm execution across heterogeneous GPU modules throughout the prototyping and operational phase. Additionally, it enables smoothly the multi-modeling deployment at which a different model has been trained over every UTM zone, capturing more precisely the local characteristics and the variance along similar geographical regions.
2.3.1 Two-stage training approach
2.3.2 Per-tile predictions
2.3.3 Close range and far range transfer learning
-
instance-based transfer which uses partial training samples in the source domain to improve the performance of the model of the target domain [79];
-
feature representation-based transfer [80] which assists the target domain classifier to learn a more effective feature expression from the source domain and improve its performance;
-
relational knowledge transfer [81] where knowledge among the data in the source domain is transferred to the target domain;
-
parameter-based transfer [82] considers that the source domain classifier and target domain classifier have the same optimal parameters, which can be found from the source domain classifier and then used for the target domain classifier.
-
The close range transfer learning consists in training the model with a subset of the input data in a given UTM grid zone (following the method described in Sect. 2.3.1) and applying it to all the 100 × 100 km2 tiles falling within the same UTM grid zone. This approach allows speeding up the training process of 485 different models and producing the predictions of a total of 30,000 tiles. It also helps overcoming overfitting issues;
-
The far range transfer learning consists in training the model with detailed samples such as MS_BFP and FB_HRS in a given UTM grid zone and applying it to a neighboring zone or to zones with similar landscape and built-up typology, at which labeled samples are scarce or zones where only GHSL_BU training datasets are available. This approach allows refining the predictions and testing the generalization capabilities of the GHS-S2Net model.
2.4 Processing infrastructure
3 Results
3.1 Training phase of CNN models per UTM grid zone
3.1.1 Hyper-parameters tuning
3.1.2 Performance evaluation
3.2 Computational performance of the GHS-S2Net models during the training and prediction phases
3.3 Qualitative assessment of the models predictions
3.4 Validation of the model predictions and assessment of generalization performance
-
Continuous assessment: by testing the GHS-S2Net output as predictor of the built-up densities at the spatial resolution of 10 m through least-square linear regression;
-
Binary assessment: by evaluating the contingency table between the binarized outputs of GHS-S2net after the application of a probability cut-off value, and the binarized reference data used as a “ground-truth.”
3.4.1 Continuous assessment: validation of the model output as predictor of built-up densities
3.4.2 Binary accuracy assessment
3.5 Comparison between the results of close range and far range transfer learning
Overall accuracy | Balanced accuracy | |||
---|---|---|---|---|
0.2 cut-off | 0.5 cut-off | 0.2 cut-off | 0.5 cut-off | |
Close range transfer learning | 0.61 | 0.67 | 0.75 | 0.76 |
Far range transfer learning | 0.77 | 0.83 | 0.81 | 0.78 |
4 Discussion and future work
-
The multi-neuro modeling methodology, which follows the UTM grid zones schema and the systematic sampling within each UTM grid zone. This approach of training multiple lightweight models at global scale allows decomposing the optimization phase into smaller tasks, which are then solved in parallel. The adopted sampling approach meets the three following criteria: class balance, diversity, and representativeness. It shows to be suitable for an optimal learning of the models at a global scale without compromising performance;
-
The transfer learning includes both the close range and the far range transfer learning. Both approaches benefit from parameter-based transfer methods where the optimal parameters found in the source domain classifier are used for the target domain. The novelty of the approach implemented in the paper was the use of the close range transfer learning within the same UTM grid zone in a way to alleviate the computational burden and avoid overfitting issues. The far range transfer learning leverages the optimal parameters found when training the models with detailed and high-quality training sets in a given UTM grid zone and then applying them to neighboring zones subject to training data scarcity. The far range transfer learning allowed allaying the scarcity and quality issues in the training sets while achieving outstanding performance in the reduction of commission and omission errors found in the best available data and in the refinement of built-up areas detection;
-
The deployment of the high-throughput processing, including data preparation, learning and inference on the multi-petabyte scale JEODPP platform. The big data multi GPU platform enables: (i) the efficient storage of the large volume of input satellite data (15 TB) and the output (1.5 TB) maps encoded in 16 bits, (ii) the parallel training of the models on an heterogeneous cluster of GPUs, and the (iii) optimal load balance in terms of data retrieval and processing from and to the distributed system due to the efficient co-location of the data with the processing units.
-
The choice of patch size: in general, assessment of CNN accuracy indicates that using larger patch sizes yields higher accuracies because the network is able to learn more contextual features. In the case of the Sentinel-2 pixel-based classification, the experiments performed by [71] on Sentinel-2 data showed that larger patch sizes (e.g., 15 × 15) did not yet yield significant improvement in the model accuracy. In this work, we tested a 10 × 10 patch size resulting in a deeper network topology, yet the loss function did not improve during the training phase whereas the prediction accuracy worsened.
-
The far range transfer learning: the strategy for implementing the far range transfer learning was based on criteria related to spatial adjacency of UTM grid zones or similarities in the landscape and in the type of built-up areas. The potential of this approach for mitigating problems in the training data and for deriving fine-grained classification outputs was clearly demonstrated in the classification results. Nevertheless, the added-value of this approach was not fully exploited in the context of this work. Additional work should focus on the analysis of spatial patterns of landscape features and typologies of built-up areas and their influence on the outputs of the classification with GHS-S2Net. The ultimate goal is to unveil the underlying rules and associations for designing a more systematic approach to identify the source and the target UTM grid zones candidate for the far range transfer learning.
-
The variable quality of the training data: despite their outstanding learning capability, the lack of accurate training data might limit the applicability of CNN models in realistic remote-sensing contexts [88]. For our global scale application, the strategy was to collect the best publicly available training data and reporting about built-up areas. The higher the spatial resolution of the training data, the more detailed is the output of the classification. Ideally, the spatial resolution of the input training data should be equal or better to that of the input Sentinel-2 imagery. As described in Sect. 2.2, the reference data sources have variable spatial resolutions. In addition, the trustworthiness of samples is highly variable across the different sources but also within the same reference data source. The lack of consistency in the training data produces outputs with variable qualities depending on the input data used for training the models. This was reflected by the results of the validation when disaggregated per continent. One approach to deal with imperfect training data was to use the far range transfer learning. However, this approach has a limited applicability at global scale since it supposes that the target UTM grid zones have similar characteristics (in terms of landscape and types of built-up areas) with the source zones. Another approach is to use a two-step training approach in which the models are first initialized by using a large amount of possibly inaccurate reference data, and then refined on a small amount of accurately labeled data, similarly to the method developed in Maggiori et al. [88]. In the context of our large-scale classification, it is perfectly reasonable to use the output produced by the GHS-S2Net to train a new model. The use of high quality and consistent outputs produced for the reference year 2018 by the application of the GHS-S2Net model at global scale is a key for frequent updates of built-up layers from Sentinel-2 Copernicus data and for continuous monitoring of built-up areas.