## 1 Introduction

^{1}Machine learning algorithms can be implemented on ready-to-use data to obtain simple, self-standing machine learning forecast or used with workflow and data processing or even apply it to artificial intelligence, where decisions are made by algorithms.

^{2}The hermetic nature of the scientific communities may give the impression that ML methods are largely inaccessible to a wider audience. However, ML has a great potential in non-big data analysis in terms of its methods being used as supplements to spatial statistics and econometrics. The goal of this paper is to present a methodological overview of machine learning in the spatial context. Firstly, it outlines the nature of the information ML gives us, and concludes if ML is substitutive or complementary to the traditional methods. Secondly, it presents two ways in which ML has been incorporated into spatial studies—by using typical ML on spatial data and by developing new ML methods dedicated to spatial data only. Thirdly, it aims to promote the application of ML to regional science. The paper concentrates solely on the following selected ML methods: unsupervised learning, which is closer to traditional statistics and encompasses clustering, and supervised learning, which is closer to econometrics and encompasses classification and regression.

^{3}A general overview of these methods is presented in “Appendix 1” and their R implementation in “Appendix 3”.

## 2 Statistical applications of machine learning in regional science

^{4}

### 2.1 Clustering of points in space

^{5}Centroids of k-means clusters are artificial points (potentially not existing in a sample), located in order to minimise distances between points within a cluster. With larger datasets, one applies the CLARA (clustering large applications) algorithm, which is the big data equivalent of PAM (partitioning around medoids). Both methods also apply distance metrics (such as Euclidean) but work iteratively in search of the best real representative point (medoid) for each cluster. In CLARA, the restrictive issue of the n × n distance matrix is solved by sample shrinking when sampling; PAM suffers from the same limitations as the k-means algorithm in this regard. Quality of clustering is typically tested with silhouette or gap statistics (see “Appendix 1”). This mechanism can be applied to delineating catchment areas (e.g. for schools, post offices and supermarkets) or to divide the market for sales representatives—in both instances, the challenge is to organise individual points around centres, with possible consideration of capacity and/or fixed location of the centre. Aside from statistical grouping, clustering has huge potential for forecasting. A calibrated clustering model enables the automatic assignment of new points to established clusters. The prediction mechanism works on the basis of the k-nearest neighbours algorithm.

### 2.2 Clustering of features regardless of location

^{6}

_{i}is the relative population in the analysed cell and eight neighbouring grid cells and \(\sum\nolimits_{i = 1}^{i = 9} {p_{i} = 1}\). Clustering of entropy, when mapped, may delineate areas with high and low local density.

### 2.3 Clustering of locations and values simultaneously

^{7}D

_{0}for values and D

_{1}for locations, in order to increase the clusters’ spatial coherence. ClustGeo (CG) developed by Chavent et al. (2018) was extended as Bootstrap ClustGeo (BCG) by Distefano et al. (2020). The bootstrapping procedure generates many CG partitions. Spatial and non-spatial attributes are combined with the Hamming distance based on dissimilarity measures (silhouette, Dunn, etc.) and are used in CG to obtain final partitioning, which minimises the inertia within clusters. The BCG approach outperforms CG, as proved by dissimilarity measures. However, the algorithms are very demanding due to the dissimilarity matrix, which limits their application in the case of big data.

### 2.4 Clustering of regression coefficients

### 2.5 Clustering based on density

^{8}there arose a group of methods based on the Voronoi/Dirichlet tessellation (Estivill-Castro and Lee 2002; Lui et al. 2008), called Autoclust. In the Voronoi diagram, for each point, the mean and standard deviation of the tile’s edges are calculated. In dense clusters, all edges are short; in the case of border points, the variance of edges increases, as one edge is significantly longer than the other. Analysis of edges and border points delineates the borders of dense clusters. The biggest advantage is that parameters (the number of clusters) are self-establishing, which is not the case with k-means or DBSCAN. This approach was also forgotten and did not become a part of machine learning due there being a lack of solutions for predictions. Recently, proposals of 3D implementations (Kim and Cho 2019) have been put forward, suggesting a revival of this method.

### 2.6 Overview of ML spatial clustering

## 3 Econometric applications of machine learning to spatial data

Type of model | Examples of usage | Thematic area | Remarks |
---|---|---|---|

Naïve Bayes | Park and Bae (2015) | Housing valuation | Model worked not the best, as C4.5 and AdaBoost. Much better was RIPPER |

Cracknell and Reading (2014) | Lithology classification | Model worked not the best, random forest was better | |

k-Nearest neighbours | Cracknell and Reading (2014) | Lithology classification | Model worked not the best, random forest was better |

Random forest | Cracknell and Reading (2014) | Lithology classification | Model worked the best |

Meyer et al. (2019) | Land cover | Focus on selection of spatial variables and spatial CV, no other models in study | |

Behrens et al. (2018) | Soil | Focus on Euclidean distance fields, model worked well. Model was compared with bagged multivariate adaptive regression splines (MARS), which also worked well | |

Ahn et al. (2020) | Soil | Focus on coordinates, distances and PCA-reduced distances as covariates, model worked well | |

Appelhans et al. (2015) | Temperature | Model performed well | |

Poverty | Model performed well, better than regression tree | ||

Hengl et al. (2018) | Soil | Focus on buffer distance, model performed well | |

Goetz et al. (2015) | Landslide susceptibility | Model worked well the same as bootstrap aggregated classification trees (bundling) with penalised discriminant analysis (BPLDA) | |

Li et al. (2011) | Seabed mud | Focus on mixture with kriging, model performed well | |

Xu and Li (2020) | Housing valuation | Focus on stacking ensemble model, model performed well | |

Hengl et al. (2017) | Soil | Model with many spatial covariates, non-spatial CV, problems of high spatial clustering of sample points; model predicts individual data which are later clustered for composite prediction, model worked well | |

Pourghasemi et al. (2020) | Gully erosion | Random forest with many spatial covariates performed better than LASSO, generalised linear model (GLM), stepwise generalised linear model (SGLM), elastic net (ENET), partial least square (PLS), ridge regression, support vector machine (SVM), classification and regression trees (CART), bagged CART. No spatial cross-validation applied | |

Support vector machines | Behrens et al. (2018) | Soil | Focus on radial basis function support vector machines (SVM) and on Euclidean distance fields, model performed poorly |

Goetz et al. (2015) | Landslide susceptibility | Model worked well | |

Li et al. (2011) | Seabed mud | Focus on mixture with kriging, model performed not the best | |

Du et al. (2020) | Land use | Strategic comparison of ML models, model performed well | |

Cracknell and Reading (2014) | Lithology classification | Model worked not the best, random forest was better | |

Neural network | Behrens et al. (2018) | Soil | Focus on Euclidean distance fields, model averaged neural network performed poorly |

Appelhans et al. (2015) | Temperature | Model-averaged neural network performed well | |

Nicolis et al. (2020) | Seismic rate | Using deep neural network—long short-term memory (LSTM) and convolutional neural networks (CNN), model worked well | |

Masolele et al. (2021) | Land use | Using deep neural network in spatio-temporal application, models worked well, spatial or temporal structures can dominate depending on dataset | |

XGBoost | Appelhans et al. (2015) | Temperature | Focus on stochastic gradient boosting, model performed well |

Hengl et al. (2017) | Soil | Model with many spatial covariates, non-spatial CV, problems of high spatial clustering of sample points; model predicts individual data which are later clustered for composite prediction, model worked well | |

Xu and Li (2020) | Housing valuation | Focus on stacking ensemble model, using adaptive boosting, gradient boosting decision tree, light gradient boosting machine and extreme gradient boosting, models performed well | |

Cubist | Behrens et al. (2018) | Soil | Focus on Euclidean distance fields, model worked well |

Appelhans et al. (2015) | Temperature | Cubist combined with residual kriging performed well |

### 3.1 Simple regression models to answer spatial questions

^{9}Similarly, Liu et al. (2020a, b) ran non-spatial regression and a random forest model on socio-economic and environmental variables to explain poverty in Yunyang, China, using data from 348 villages. The only computational spatial component was the Moran test of residuals, which showed no evidence of spatial autocorrelation. The study was effective because it merged different sources of geo-projected data: surface data for elevation, slope, land cover types and natural disasters (with spatial resolution of 30 m or 1:2000); point data, such as access to town, markets, hospitals, bank, schools, or industry, taken from POI (point-of-interest) or road density networks (on a scale of 1:120,000); and polygonal data for the labour force from a statistical office. Rodríguez‐Pérez et al. (2020) modelled lightning‐triggered fires in geo-located grid cells in Spain. They used RF, a generalised additive model (GAM) and spatial models to show instances of lightning‐triggered fires appearing in a given grid-cell were attributable to observable features in that location, such as vegetation type and structure, terrain, climate and lightning characteristics. Also, an applied example of statistical learning in a book by Lovelace et al. (2019) uses a generalised linear model on rastered data of landslides (e.g. slope, elevation) with point data of interest. The spatial location and autocorrelation are included in spatial cross-validation.

### 3.2 Spatial cross-validation

### 3.3 Image recognition in spatial classification tasks

### 3.4 Mixtures of GWR and machine learning models

^{2}and a decrease in RMSE (root mean square error). Quiñones et al. (2021) applied GWR concept with RF in analysing diabetes prevalence and showed that it detects well the spatial heterogeneity.

^{10}

### 3.5 Spatial variables in machine learning models

^{11}

### 3.6 Overview of spatial ML regression and classification models

^{12}algorithms when obtaining a mutual relationship between the dependent (y) and explanatory (x) data. Regression models are used to explain usually continuous variables, while classification models are used for categorical variables. ML models for spatial data have mostly neglected the issue of spatial autocorrelation between observations. The latest studies, however, have aimed to address this issue by using spatial variables among covariates. These can be geo-coordinates, distance to a given point (e.g. core), mutual distances between observations, PCA-reduced mutual distances between variables, buffer distance, spatial lag of the variable and eigenvector or Euclidean distance fields. Addressing the spatial autocorrelation issue not only enables the training data to be successfully reproduced well but also allows predictions to be made in new locations beyond the dataset (Meyer et al. 2019). GWR-like local machine learning regression bridges the gap between spatial and ML modelling. This stage results in sets of global or local regression coefficients or thresholds of decision trees.