Spatial predictive modeling of prehistoric sites in the Bohemian-Moravian Highlands based on graph similarity analysis

Adam Mertel; Peter Ondrejka; Klára Šabatová

doi:10.1515/geo-2018-0020

Open Access Published by De Gruyter Open Access July 20, 2018

Spatial predictive modeling of prehistoric sites in the Bohemian-Moravian Highlands based on graph similarity analysis

Adam Mertel , Peter Ondrejka and Klára Šabatová

From the journal Open Geosciences

https://doi.org/10.1515/geo-2018-0020

Abstract

This paper presented a new method for identifying promising areas for archaeological research. The method is based on graph analysis that iteratively compares and manipulates Hamming distances between graphs of input geographical parameters and graphs of human activity data in various historical periods. The weights learned from the comparison was used to build a prediction model to estimate the potential presence of an archaeological site of a certain time period in a given cadaster. This was applied in the Bohemian Moravian Highlands region based on the most complete archaeological dataset of the area. Resulting maps were analyzed from the archaeological and historical point of view to test against the existing knowledge of prehistoric population movement in the region. Overall, the method proved to overcome problems such as fragmentary inputs and is a good candidate for application in smaller and geographically diverse research areas. The aim of this work was to contribute to the methodology of the prediction of historical human activity, to facilitate greater comprehension of past local settlement dynamics, and to possibly ease the protection of cultural heritage.

Keywords: graph analysis; machine learning; spatial modeling; archaeological predictive modeling; historical landscape

1 Introduction

Archaeological sites provide an invaluable source of information on past cultures. Yet, due to present rapid changes of the landscape caused by human activities, these unique sources are often destroyed or compromised before excavation. Locating areas with high archaeological potential is therefore an important topic in current archaeology, as it can help to identify and preserve sites for analysis in their original context. Also, a working predictive model can be used as an efficient solution to the lack of funding by minimizing the number of excavations [1].

Research tools in archaeology have been enriched in recent years. Data from sources such as satellite imagery, laser scanning or UAVs (drones), can now be analysed with help of the increasing computation power of current information technologies (see for example [2]). The introduction of new methods of modeling and analysis comes hand in hand with this trend.

Researchers seek ways to combine historical data from already excavated sites with environmental characteristics of landscape to assess the potential for finding new archaeological sites. A set of various methods used for this task is often labeled as archaeological predictive modeling (APM). Also, geographic information systems (GIS) have already become a common part of research practice, and there are other methods, particularly in the field of data mining and machine learning, that have yet to prove their utility in archaeology.

One of the research areas that could donate interesting methods for APM is the graph similarity analysis. The historical and environmental relationships between research sites can be represented as non-directed graphs, which opens doors for a wide range of graph-related analytical tools to be applied on archaeological data. This paper aims to explore the potential of graph similarity methods for analysis and modelling of past human activities.

One problem in current archaeological models that could be addressed by graph similarity methods is the calculation of the relative influence of environmental parameters on historical settlement. There is an extensive body of research dealing with connections between archaeological and environmental data, however in current models, the weight of individual environmental variables is either set arbitrarily or is relying on expert assumptions external to input data. Graph similarity methods provide a way to estimate the weights directly from the input data.

We extracted input data from the database of archaeological sites created as part of the project Historical Landscape of Bohemian Moravian Highlands [3]. The amount of input data and the spatial and temporal granularity of resulting models is unprecedented in previous studies of the region of interest. Our proposed method consists of three parts:

Association of archaeological data from various periods with environmental data in given cadasters
Iterative graph comparison. We apply the Hamming distance method taken from the graph theory as a comparison method to find relationship between the existence of a human activity in historical eras and the environmental conditions on the site
The observed relationship (expressed by learned weights) is then used to estimate the potential of finding a chronological component in cadasters where an archaeological site has not yet been established. The estimation is rigorous both in space and time as it allows to model the occurrence probability for individual historical eras

The modeling method was tested for its relevance in archaeological context at the larger scale of the Bohemian-Moravian Highlands, where it proved to reasonably describe differences between past human activity patterns in selected subsequent prehistoric eras. However, the method can be used to estimate archaeological potential in much smaller research areas for more precise locality prediction.

2 Related work

The topic of APM has been approached differently by various authors, resulting in a considerably large and varying set of methods being associated with APM. There are also various definitions of APM (see [4, 5] for comprehensive overview), but in essence it is “a tool that indicates the likelihood of cultural material being present at a location” [6, 7]. Development of APMs began decades ago [8]. It is an inductive method derived from observations rather than from theory, though some APM models may be formed from a successful but informal hypothesis and later examined under simple statistical procedures [9]. From the range of statistical methods, logistic regression modeling came out as a classical tool for APM [10] being revisited in recent works [11]. Classical regression models are still being used, though with model testing improved by using randomly-generated non-sites for verification [4].

Recognizing the inherent spatial aspect of the discipline, archaeologists have been increasingly using GIS in their research practices. This phenomenon was mapped by several authors from historical, conceptual and methodological standpoints [12, 13, 14, 15]. GIS tools have proven useful especially for preparing and combining input layers for APM. What differs among authors is the nature of spatial data and methods leveraged in their prediction models. Some propose an enhancement of APMs based on feature extraction and classification of multi-spectral remote sensing images [16], others came up with an alternative evidence density estimation (EDE) function model [17]. Though APM is usually based on combining several environmental parameters, sometimes a single aspect is considered, for example some authors focus on erosion, as they see it as a force that can help to uncover archaeological sites [18]. Such finding is however rooted in specific geographic conditions that are usually not present in Czech Republic, where our study area is situated.

What also differs is how researchers delineate the area of environmental influence on the archaeological site. One possibility is to use uniform buffer zones around sites like [1], an alternative is to employ some administrative units for computing averages of environmental variables. The advantage of the latter approach is that the administrative units can be joined with data from other sources, though the units need to be small enough to avoid disguising the spatial pattern. Furthermore, the spatial data used in archaeology always come with a certain error in location accuracy. While many authors treat this as a given fact, some explore fuzzy classification methods for mitigating the issue [19], while others use a classification and regression tree (CART) algorithm to incorporate the inherent uncertainty into their model [20]. Fuzzy logic system has also been used to address the problem of study area with low relief changes [21].

A common problem in APM is determining weights to be assigned to input spatial layers. One way is to classify spatial layers beforehand into four categories: layers of exclusive influence (for example water), layers of general influence, attractors and deflectors [22]. The disadvantages of such approach include a lack of statistical significance and often unrealistic assumptions of environmental uniformity of study areas [22]. However, these problems can be sufficiently tempered by combining a large number of known sites with relatively small areas of spatial units. Another ways are e.g. to use MaxEnt [23] or ECNM [24] models.

APM finds application both in cultural heritage management [25] as well as a means of understanding settlement-patterning causation [12]. The models are often fine-tuned to environmental specifics of a given study area, hence the large number of case studies that have been conducted [1, 4, 7, 26, 27].

3 Study area, input dataset and data preprocessing

The input archaeological database contains information about 4153 archaeological sites (September 2017) in southwestern Moravia and eastern Bohemia. The database was created as part of the project Historical Landscape of Bohemian-Moravian Highlands (the project is hosted on https://naki.phil.muni.cz/). The database is currently the most complete dataset in the region, but it can never be considered complete in terms of archaeological data. The main area the project concentrated on, was the Vysočina Region, which we extended for the old agriculture landscape by 30 cadasters from the former district of Znojmo (Figure 1). The aim of the project was to record information about archaeological sites in the mentioned region from the prehistoric to the Late Middle Ages in order to facilitate their protection as a cultural heritage. As a part of the project, software tools were developed to provide the capability to analyse the spatio-temporal relations of archaeological sites and provide the possibility to export derived text or visualization reports.

Figure 1

The study area with research sites used in analysis.

The geographic extent of further analyses in this paper was defined by the focus of this project. It should be noted that it is not completely homogenous from the perspective of environmental conditions. The southernmost part is flat and with the good quality of soil, while the north was extensively forested and not easily accessible. This is also the reason why most of the archaeological locations are located in the southeastern part.

Each archaeological site recorded in the database is associated with a large variety of attributes specified for each of the 65 typified components (historical eras). For this work we used only a small subset of these data, namely:

geographical location represented by the coordinate pair of the central point of the archaeological site.
chronological determination - historical eras in which the site existed

A so called site is equal to an archaeological component in this paper. “A component is a set of artifacts and complexes at a site, coming from one period and serving one purpose” [28, 29]. Archaeological sites in the project Historical Landscape of Bohemian-Moravian Highlands were divided into more “functional components” (residential components, mortuary components, partial components and hoards, etc.) and “chronological components” (Early Neolithic, Middle Bronze Age, etc. see Table 1). For the purposes of this prediction, however, all archaeological sites were considered as a possible trace human activities. Archaeological sites without valid spatial information or outside of our study area were filtered out, as were those with less clear identification of the archaeological component (for example assessed by palynological, or dendrochronological methods). Also, for further analysis, only chronological components until the Early Middle Ages were chosen. This left us with a subset of the 1305 sites. Because of the hierarchical structure of the archaeological component, the dataset had to be adjusted so that every site with a specific functional or chronological component also contained its parent components. Afterwards, sites with more than one archaeological component were cloned so that components could be redistributed within these copies. At the end of this procedure, the dataset consisted of 2272 cleaned archeological components.

Table 1

Chronological components selected in the model.

Prehistoric and Historical Periods In sequence	Absolut dates	Chronological components selected in model	Absolute dates
Mesolithic	8000 – 5600 BC	Mesolithic	8000- 5600 BC
Neolithic	5600 – 4500 BC	Middle Neolithic	4900-4600 BC
Eneolithic	4500 – 2100 BC	Final Eneolithic	2900- 2100 BC
Bronze Age	2100 – 800 BC	Final Bronze Age	1000-800 BC
Iron Age	800 – 50 BC
Roman and Migration Period	50 BC - 550 AD
Early Middle Ages	550 – 1250 AD	9th century	800 – 900 AD

		12th century	1100 – 1200 AD

To study the relation between physical predispositions of terrain and human activity in historical eras, we chose cadasters (administrative spatial units) as spatial division units. This is simply because cadasters are the most finegrained layer attainable from the official sources [30]. In our region they also tend to reflect the expansion of human activity as the majority of the settlement structure remained to the modern age. Also, compared to an artificial structure such as square grid, cadaster borders tend to better reflect the natural borders in terrain. At the time of writing, 1437 cadasters existed (their mean area was 5.4 square kilometers).

Based on the referenced literature [31, 32, 33, 34, 35], we selected the following values: elevation, slope, topographic wetness index (TWI), topographic position index (TPI), closeness to water stream and quality of soil. To calculate or estimate the aforementioned variables for each cadaster, the following datasets were used:

SRTM (Shuttle Radar Topography Mission) DEM obtained from [36]
The volume of coarse fragments in soil in 30 cm depth obtained from [37]

4 Methods

The work was divided into three parts:

Pre-calculations – calculating static geographical variables and assigning archaeological sites to cadasters
Graph analysis – comparison of distance matrices and searching weights for static geographical variables
Prediction model – calculating probabilities of site occurrences for cadasters

4.1 Pre-calculations

Within the first step of the analysis we calculated the static geographical variables mentioned for every cadaster:

Elevation – average value of altitude
Slope – average slope in degrees derived from elevation
Access to water – first we modeled the stream network from terrain, then we created a buffer of 500 meters and the proportion of the area inside a cadaster within this buffer was calculated. We derived the buffer distance from literature [19], it represents an approximate walking time of 15 minutes for the journey there and back [11]
Topographic wetness index (TWI) – based on formula Ln(a/tanB), where a is water accumulation (from the modeled water network) and B is the slope value [38]
Topographic position index (TPI) – as a proxy for TPI, the slope variability in a rectangular area of 300 m square [39]
Coarse Fragments– the average volume of coarse fragments in soil was used to represent soil quality in each cadaster

In order to study the sensitivity of the input data we calculated the basic statistics for each of the variables. Also, we calculated pearson coefficients and created overview maps to show the spatial distribution (Figure 2).

Figure 2

Statistical exploration of geographic variables

As a next step, we standardized the values of the geographical variables. We used the zscore method as it works better with skewed distributions of input data. The normalisation was not used as we wanted to preserve the original distribution, and also our method does not require normalisation.

To bind all cleaned and filtered archaeological components with cadasters (see chapter 3 for details on the cleaning procedure) we used QGIS (version 2.14) script that automated the built-in Points in Polygon method. Every cadaster containing at least one site was then tagged and associated with a list of archaeological components from corresponding sites. In the following step, we calculated mean values of geographical variables for every archaeological component (see Figure 3 for a simplified illustration of the procedure).

Figure 3

The process of data extraction and binding in the first step of the analysis. Note that the calculation of average values per cadaster is repeated for each geographical variable separately. Therefore, the outputs of this process are: the table of cadasters with associated archaeological components, and the table of archaeological components with associated average values of geographic variables (elevation, slope, etc.).

For the further analysis we filtered cadasters that contained at least one component (the condition of the model). Then we divided cadasters into two parts. The fist one consisted of 113 randomly selected cadasters was used as a testing subset. The remaining subset of 225 cadasters was used for training.

4.2 Graph analysis

The goal of this part of the analysis was to find appropriate weights for all geographical variables to estimate how much they participated in the overall variability of particular archaeological components. See Figure 4 for a graphic depiction of the workflow.

Figure 4

The process of graph comparison and iterative weighting used to obtain weights in the second step of analysis.

From the filtered training subset (see chapter 4.1) we created two tables:

First table contained a list of cadasters and the archaeological components of these sites
The second table was the list of the same cadasters but with the standardised (zscore method) values of statistical geographical variables

The above inputs served as a basis for creating a set of distance matrices representing how different each cadaster is from other cadasters. For the calculation we used the dist method implemented in R (stats package [40]) that provides several algorithms for calculating distances. First distance matrix that represented differences based on the presence of archaeological components was calculated using the Canberra algorithm omits zero values from the calculation and therefore is less sensitive to data absence. The remaining matrices were calculated from geographical variables using euclidean distances.

A non-directed graph is one possible way to reimagine a distance matrix as a spatial object. The distance matrices were turned into graphs posed in an n-dimensional space, where the number of dimensions was given by the number of cadasters. In other words, every cadaster was represented by a node in a graph, and the coordinates of the node were given by the values in the distance matrix (i.e. differences from other cadasters). Hypothetically, if some geographical variable had been perfectly uniform across the study area, it would have been represented by a nulled distance matrix that would turn into a graph with nodes concentrated in one point.

The next step in our process was based on an assumption that if a geographical variable strongly influenced the distribution of archaeological sites, it would produce a graph very similar to the graph of archaeological components. The more similar the graphs are, the better is the archaeological variability of sites explained by their geographical variability.

For assessing the similarity of the graphs we used the Hamming distance method. The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other [41]. The concept introduced by Hamming in 1950 [42] is used for comparing signals in information theory. Hamming distance between graphs is a minimal sum of manhattan distances the nodes of one graph would have to travel to become equal to another graph. Note that by using distance matrices with the same number of elements we fulfill the condition of comparing equal length strings.

We designed an script to find weights for geographical variables that would minimize the respective Hamming distances. The input geographical variables were iteratively multiplied by different coefficients, distance matrices were re-calculated leading to new graphs, then to new comparison with the graph of archaeological components, and re-calculation of Hamming distances. An increase or decrease of Hamming distances then showed how much the weights helped to make the graphs of geographical variables more similar to the archaeological components graph. This information was used in the next iteration to adjust the weights in the correct direction, and so over and over.

This method was implemented within an R language script (scripts are located on https://github.com/geogr-muni/mining-in-archaeology). Sna package [43] was used to work with graphs and provided the hdist method (Hamming distance) that calculates absolute distance of labeled graphs.

After reaching the set number of iterations the script came to the following result weights for geographical variables:

Slope: 0.076
Elevation: 0.079
TPI: 0.22
TWI: 0.03
Soil quality: 0.304
Access to water: 0.289

According to the results above, the variable that best predicts archaeological components of archaeological sites in a particular cadaster is the quality of soil (weight 0.304 = 30,4% from all weights). This seems to coincide with the outcomes of research relevant to our spatiotemporal extent (for example [44]). The least important variables seem to be the wetness index (weight 0.03 = 3% from all weights). This means that the TWI index had to be significantly suppressed to make the graphs more similar (to make the hamming distance smaller).

4.3 Prediction model

For further analysis and interpretation, we chose three pairs of archaeological components (near-term periods of Prehistory and Early Middle Ages). We chose subsequent components to track changes in the settlement structure that are already documented in the archaeological literature about the study area. The objective is to verify whether our prediction method reflects known settlement changes, and therefore is valid for application with archaeological data. All selected components are represented by a sufficient number of precisely localized sites to form a meaningful model. Also, the chosen pairs are in a certain way interesting from the archaeological point of view as they correspond with the presently known shifts in settlement patterns. Those selected component pairs are (see Table 1 for more precise chronology):

Mesolithic and Middle Neolithic
Final Eneolithic and Final Bronze Age
Early Middle Ages (9th century) and Early Middle Ages (12th century)

To visualise the distribution of values within the components, we created three customised parallel coordinate diagrams (Figure 5). We decided to group them by pairs of components to make the differences better comparable.

Figure 5

Customised parallel coordinate diagram created to visualise the comparison between chosen pairs of archaeological components and their descriptive statistics taken from the training subset. The circle represents the mean value, while colored vertical lines represent the first and last quartiles. Geographical variables and their units are described in section 4.1. Geographical parameters were ordered based on their calculated weights.

Selected averages of geographical parameters were used together with the weights (obtained in section 4.2) for creation of a prediction model (Figure 6). In this step, we only worked with six mentioned chronological components. Then we calculated a distance in multidimensional space, which means that for each component we established how far (different) the cadasters’ geographical parameters were from the modelled component mean. The input values ranged between 0 and 1, therefore the calculated output distances were also within the limits of 0 and 1. A distance close to 0 can be interpreted as the cadaster being very similar to those cadasters with a specified archaeological component. So there is a higher possibility that this cadaster might contain traces of a human activities from that component. Conversely, the greater distance means that the cadaster does not have conditions that would predict increased human activity in the specific time period.

Figure 6

The process of creating prediction models in the third step of analysis.

In the next step we created a set of six maps. Each one represents the prediction model of one of the selected archaeological components. Furthermore, we added three maps depicting the differences between pairs of components to facilitate the interpretation of the pattern changes from an archaeological standpoint (Figure 8).

4.4 Model evaluation

To validate our weights we used the testing set of remaining 113 cadasters. We calculated distances from each testing cadaster to the mean values (calculated from the training subset) of each component based on the weights taken from the training process. We used the AUC value that stands for the area under the ROC (receiver operator curve) curve and is commonly used for evaluating models [45]. AUC is often used when the model is not meant as a binary predictor but rather as a point in the probability spectrum, which is exactly the case of our prediction (Figure 8). In such cases ROC is beneficial because it considers varied thresholds for evaluating the model’s predictive ability. The global AUC for the entire testing subset was 0.65 (Figure 7). This number indicates that the model was valid and better than random guess (equals to AUC of 0.5, while the ideal perfect model has AUC of 1).

Figure 7

AUC values for the all the components that have at least five occurrences in the training subset and are at least the second level in the hierarchy. The green line is the global AUC value and the red dotted line symbolises the random guess threshold.

Figure 8

Probabilities of occurrence for 6 selected archaeological components and visualisation of the change of probabilities between two components in each pair.

One of the reasons why the value of AUC is not higher is the complexity of data - we are counting with 56 components and only 1 set of weights what is the limit of the used method. Also, the proportion of positive values is small (just 803 positive cases from the total of 6328 in the training subset). The AUC metrics is seen as discriminating by some authors [46]. In our case, it should be noted that this method weights omission and commission errors equally and is vulnerable to missing positive values in input data (the archaeological dataset is never complete).

5 Results and Discussion

Based on the present state of research, archaeological and historical data, we were able to discuss and verify the validity of the method used. One of the validation approaches is to show how our model could extend our knowledge about the history of this region.

The first pair of selected historical components is composed of the Mesolithic and Middle Neolithic periods. It is a well-known fact that the settlement patterns in Mesolithic hunter-gatherers and in the first prehistoric agricultural communities are different. Based on these ethnographic analogies we can suppose that the crucial criterion for hunter-gatherers was the seasonal usage of various natural resources, which is mainly fulfilled by the biotope border zones [47]. Current paleobotanical and archaeological research shows that even in the Mesolithic people influenced the landscape. Especially the preference and deliberate cultivation of some plant species, as well as the firing that created the forest clearings. These Mesolithic forest clearings mosaic structured heterogeneous vegetation; appear in the paleobotanical record of the studied areas, including Bohemian-Moravian Highlands, with the exception of mountainous areas [48]. On the contrary, the agricultural population of the Neolithic is naturally bound to lands with very good soil quality, in southern Moravia most often on loess river terraces. It is also possible to assume that settlement was tied to the fields in the Neolithic and therefore was more sedentary and with more intensive local land use.

When we look at the geographic parameters (Figure 5), we can see that Mesolithic sites (in contrast to the Neolithic sites) exhibit a higher soil granulation, terrain slope and elevation. Among all selected components we can also note in them the largest variability in the relation of populated landscape to water accumulation (TWI). From the viewpoint of the prediction model it is clear that the probability of the Mesolithic component prediction is highest within the studied area of Bohemian-Moravian Highlands (Figure 8a). On the other hand, the Middle Neolithic settlements in our model are located in those cadasters with better soil quality that are situated in a less sloping terrain and mainly in the neighborhood of river valleys (Figure 8b). The difference between these two components (Figure 8a-b) shows how different the probability of detecting Mesolithic and Neolithic activities in individual cadasters is, which corresponds with the present state of research.

The second selected pair contains the Final Eneolithic (represented by Bell Beaker Culture in this region) and Final Bronze Age. During the Final Eneolithic and the Final Bronze Age the society is developing rapidly. Among the changes influencing the settlement pattern we can count changes in the agriculture and the crop husbandry, population growth and the significant presence of hilltop settlements in strategic locations in the Final Bronze Age. Neolithic farming practices and package of crops were still in use during the Final Eneolithic [49]. At that time, the number of archaeological contexts is increasing, compared to the period 4000-3000 BC. The low number of objects and an increase in secondary forest taxa is interpreted as the importance of extensive land use accompanied by change of agriculture and pastoralist practices [50]. The slightly modified Neolithic farming practices and package of crops were still in use in the Early Bronze Age. This tradition appears to have been altered first in the Middle Bronze Age, when the changes in arable farming and establishing new crops probably began, and developed further during the Final Bronze Age, when more new plants were established [49]. It is very likely that the changes in the arable economy between the two compared periods of the Final Eneolithic and the Final Bronze Age allowed the population growth. The prevailing form of the Final Eneolithic residential site is a lowland settlement. The evidence about open lowland settlements is in all of the periods from the Final Eneolithic to the Final Bronze Age, but the hilltop settlements are present not ever. The most important evidence of hilltop sites comes just from the Urnfield Culture of the Final Bronze Age.

Based on the present state of research, the sites in the Final Eneolithic appear already at the same locations as those in the Bronze Age [51]. This fact could also be verified on Figure 5, where characteristics that describe greater than 55 % probability in sum (coarse fragments, slope, TPI) are similar for the pair. The difference in coarse fragments shows that in the Final Bronze Age the better soil quality was preferred for settlement. The other characteristics, such as the access to water and TWI index, might be related to the hilltop locations typical for the Final Bronze Age, the Final Urnfield Period. The difference in elevation that points to the preference of Highlands in the Final Eneolithic probably reflects the extensive land use that has not yet been abandoned. This interpretation is also supported by the fact that the model does not show any significant differences in probabilities for the settlement of those two periods (Figure 8c and d). Figure 8c-d then shows a shift where Final Bronze Age has a slightly more positive relation to the locations in lowlands and therefore the probability of finding evidence of Bell Beaker Culture is higher in the Vysočina Region which corresponds to the Eneolithic extensive land use. This correspondence between model and contemporary archaeological findings is then another argument for the model’s validity.

The last pair selected for the analysis was the 9th and the 12th century of the Early Middle Ages. These periods could be ready described by the knowledge of history, mainly by the waves of medieval colonisation [52]. But there are no comprehensive studies for the development of agriculture in the Early Middle Ages. Partial findings show that in the Early Middle Ages the permanent fields with a fallow period prevailed with the accent on grain production. Between the 9th and 12th centuries, economic growth is anticipated which in the 12th century already brings the beginnings of a three-field system. According to the present state of knowledge, not only agricultural settlements but also Great Moravian agglomerations in the 9th century were self-supplied [53]. Between the 9th and 12th centuries, strong dynamics and population growth are also expected. A traditional medieval village in old settled landscape is formed in the 11th and 12th century. At the end of the 12th century the settlement structure is stabilized, which, with exception of deserted villages, persists in the High Middle Ages and to the modern period [51]. On the other hand, in the central Bohemian-Moravian Highlands the beginning of this process falls only to the 12th century, but could also be later [52]. Generally, in the 12th century, the first centres of an urban character appeared (we can talk about towns with the admission of city rights in the 13th century), and these were primarily of semi-agriculture character [53].

Based on Figure 5, the 9th century is characterised, in contrast to the 12th century, by soils of higher quality and lower elevation level associated with sites. To explain the population shift in this period, the factor of elevation is crucial as the high values of elevation in 12. century refer to new colonial settlements built in higher locations. The values for soil bonity and elevation then show intensive relation to agriculture at the beginning of the colonisation (Figure 8e). The characteristic of water accumulation in the landscape (TWI) has higher value for the 12. century and describes connectivity of settlements to the river valleys in central Bohemian-Moravian Highlands. The access to water characteristic is are almost the same level and therefore not important for distinguishing these two components. Also for this pair, the analysis of geographical characteristics and their synthesis (Figure 8e-f) corresponds with the current state of archaeological and historical knowledge, and therefore further validates our model.

6 Conclusions and future works

In this paper we presented a method to predict promising areas for archaeological research using machine learning on positive examples, more precisely graph comparison with Hamming distance method. The model was used on the dataset from the project of Historical Landscape of Bohemian Moravian Highlands [3] that consists of 56 chronological components and had never been analysed in such scope. Our predictions are based on input geographical data such as elevation, terrain slope and selected indices derived for the area of input administrative units. These input parameters probably influenced other characteristics, like the amount of natural sources, conditions for agriculture or available possibilities for shelters.

To determine the weights for input environmental parameters, we used an algorithm to calculate the distances between undirected labeled graphs. At the beginning, we created a reference distance matrix of training cadasters based on the similarity of their archaeological components. Then, various weights were applied to modify geographical variables of cadasters and the calculated distance matrices were iteratively compared to the reference matrix. Based on this comparison, a set of most fitting weights was established. Using the calculated weights, we compared every cadaster to each component’s modeled values and calculated the theoretical probabilities (the similarity of the cadaster to the modeled values).

To verify that the models respect archaeological reality, we selected pairs of near-term periods of Prehistory and the Early Middle Ages. We interpreted the modelled differences between those pairs from the archaeological point of view and tried to analyse the geographical preferences of past societies based on the temporal context. On the accompanying maps, we were able to follow developments that corresponded to known archaeological and historical data.

For the statistical validation of our model we calculated the AUC values for both the entire testing subset and for each component separately. The results shows that the global value of AUC for the model does not significantly near towards the value of the ideal perfect model. We should note that the validation is extremely difficult as we do not predict only the presence of archaeological location in cadaster, but also the chronological components (we are working with 56 different chronological components), which brings additional complexity and hardens interpretation. However, when we look at the AUC values for individual components, we observe greater variability with some components with high AUC value.

One possible extension would be to calculate separate sets of weights for individual components, contrary to our current approach that shows what variable influenced the region history in overall. This would be limited by the large number of components, that could be aggregated to a lower number of groups to better fit the model. Also some extended methods of validation could be also considered, e.g. similar to [21].

The advantage of this method is in its robustness, which allows for usage with incomplete input datasets. The condition is to have input data, to some degree, regularly displaced throughout the region to copy the spatial distribution of researched phenomenon. Contrary to the majority of previous works, the weights of input spatial datasets used for the prediction model are estimated algorithmically from the input data by using graph analysis. This method does not have a priori statistical assumptions and is not sensitive to outliers. It also benefits from the larger temporal scope of the input dataset, which allowed us to interpret the location preferences of people during various ages. The algorithm used adjusts itself to to the specifics of the input environmental parameter, therefore it should be applicable in various geographically diverse regions.

As this paper shows, spatial information is one of the crucial characteristics of archaeological data that define its context. Looking for geographical patterns and a relation to the study phenomenon is then the role of Geographic Information Sciences (GIS). Applying and testing new methods from within GIS, as well as from machine learning could therefore help us to understand the geographical pattern regularities in archeology, and therefore advance the understanding of the history of humankind.

For the future work, additional geographical variables could be added to the model, for example:

Intensity of forestation
Natural conditions for transport (friction of the terrain) or trade (centrality of place) but also the real infrastructure (the presence of transport network)
Dominant aspect (it could play a role in specific archaeological components such as agricultural objects)
Latitude

As a future attempt to bring the model nearer to real conditions, we could decouple input and output data from the administrative units and model them in an interpolated continuous space. In such a model, input environmental variables could be weighted separately and with spatially-dependent weights. Also, the influence of the immediate neighborhood area on each place could be taken into consideration, and we could also experiment with various neighborhood sizes. On the other side, the spatial aggregation we used is a suitable and tested way to manipulate complex datasets to prevent overprecision or overfitting. A reasonable alternative method is a rectangular or hexagonal grid with a cell size that could prevent overcomplexity but avoid aggregation of too heterogeneous areas. In our study region, the cadaster borders better reflect the physical conditions of the environment and therefore are better for localising centers of human activity when compared to artificial borders of a grid.

The outcome of the model is a spatial dataset of cadasters with the predicted suitability for each archaeological component. The advantage of this form of output is the possibility of further analyses based on algebraic operations. Figure 8 shows the result of a minus operation which presents patterns and directions of people moving between subsequent archaeological components. Besides such simple operations, it is also possible to build a complex model that could bring new knowledge of the particular region and processes of forming settlements in various times. Such a model can, for example, show which places were more persistent from the view of historical progress (meaning that their settlement probability ranked high for the majority of the archaeological components) and formed the backbone of the region (from the perspective of transport, military, power, market, or others), and which places played only a marginal role.

An alternative way of visualisation besides difference maps is an animation that could display the predicted movement of settlers through the region in time. There is also the possibility of building a geographical network (where each cadaster is a node and edges are connections between adjacent cadasters) suitable for agent based simulations or other network analyses. Agents could be represented by simulated communities that could move between nodes (creating or abandoning settlements) based on the suitability score of a node derived from our current results. These additional methods could produce new outcomes as they take into account centrality positions and geographical neighborhood.

Acknowledgement

This paper is an outcome of projects:

Integrated Research on Environmental Changes in the Landscape Sphere of Earth II (MUNI/A/1419/2016).
Archaeological Survey, Excavation, Documentation and Museum Presentation VII (MUNI/A/0734/2017).

Our scripts and input datasets were published in the github repository at https://github.com/geogr-muni/mining-in-archaeology-research.

References

[1] Balla A., Pavlogeorgatos G., Tsiafakis D., Pavlidis G., Efficient predictive modelling for archaeological research. Mediterr. Archaeol. Ar., 2013, 14(1), 119-12910.1016/j.culher.2012.10.011Search in Google Scholar

[2] Campana S., Drones in Archaeology. State-of-the-art and Future Perspectives. Archaeol. Prospec., 2017, 10.1002/arp.1569Search in Google Scholar

[3] Historické využívání krajiny Českomoravské vrchoviny v pravěku a středověku. 2016 (in Czech) https://naki.phil.muni.czSearch in Google Scholar

[4] Campbell J. S., Archaeological Predictive Model of Southwestern Kansas. Doctoral dissertation, University of Kansas, USA, 2005Search in Google Scholar

[5] Danese M., Masini N., Biscione M., Lasaponara R., Predictive modeling for preventive Archaeology: overview and case study. Open Geosci., 2014, 6(1), 42-5510.2478/s13533-012-0160-5Search in Google Scholar

[6] Gibbon G., Appendix A: Archaeological Predictive Modeling: An Overview. In: Hudak G. J., Hobbs E., Brooks A., Sersland C. A., Phillips. C. (Eds.), A Predictive Model of Precontact Archaeological Site Location for the State of Minnesota. Minnesota Department of Transportation, St. Paul, 2000Search in Google Scholar

[7] Warren R. E., Asch D. L., A Predictive Model of Archaeological Site Location in the Eastern Prairie Peninsula. In: Wescott K. L., Brandon R. J. (Eds.), Practical Applications of GIS for Archaeologists: A Predictive Modeling Kit, Taylor & Fisher, London, 1999, 5-2510.1201/b16822-3Search in Google Scholar

[8] Willey G. R., Prehistoric settlement patterns in the Viru Valley, Peru. U.S. Govt. Print. Off, Washington, 1953Search in Google Scholar

[9] Chen L., Priebe C. E., Sussman D. L., Comer D. C., Megarry W. P., Tilton J. C., Enhanced Archaeological Predictive Modelling in Space Archaeology. 2013, arXiv preprint arXiv:1301.2738Search in Google Scholar

[10] Kvamme K. L., Computer processing techniques for regional modeling of archaeological site locations. Advances in Computer Archaeology, 1983, 1(1), 26-52Search in Google Scholar

[11] Fernandes R., Geeven G., Soetens S., Klontza-Jaklova V., Deletion/Substitution/Addition (DSA) model selection algorithm applied to the study of archaeological settlement patterning. J. Archaeol. Sci., 2011, 38(9), 2293-230010.1016/j.jas.2011.03.035Search in Google Scholar

[12] Stine R. S., Decker T. D., Archaeology, data integration and GIS. In: Allen K.M.S., Green S.W., Zubrow E.B.W. (Eds.), Interpreting Space: GIS and Archaeology. Taylor and Francis, London, 1990, 73-79Search in Google Scholar

[13] Wheatley D., Gillings M., Spatial technology and archaeology: the archaeological applications of GIS. CRC Press, Boca Ranton, 200310.4324/9780203302392Search in Google Scholar

[14] Conolly J., Lake M., Geographical information systems in archaeology. Cambridge University Press, Cambridge, 200610.1017/CBO9780511807459Search in Google Scholar

[15] Kvamme K. L., There and back again: Revisiting archaeological locational modeling. GIS and archaeological site location modeling, 2006, 3-3810.1201/9780203563359.sec1Search in Google Scholar

[16] Chen L., Comer D. C., Priebe C. E., Sussman D., Tilton J. C., Refinement of a method for identifying probable archaeological sites from remotely sensed data. In: Mapping Archaeological Landscapes from Space 2013, 251-258, Springer, New York10.1007/978-1-4614-6074-9_21Search in Google Scholar

[17] Demján P., Dreslerová D., Modelling distribution of archaeological settlement evidence based on heterogeneous spatial and temporal data. J. Archaeol. Sci., 2016, 69, 100-10910.1016/j.jas.2016.04.003Search in Google Scholar

[18] Ebert D., Singer M., GIS, Predictive Modelling, Erosion, Site Monitoring. Assemblage - The Sheffield Graduate Journal of Archaeology, 2004, 8Search in Google Scholar

[19] Lieskovský T., Ďuračiová R., Karell L., Selected mathematical principles of archaeological predictive models creation and validation in the GIS environment. Interdisciplinaria archaeologica. Natural Sciences in Archaeology, 2013, 4(2), 33-4610.24916/iansa.2013.2.4Search in Google Scholar

[20] Jasiewicz J., Sobkowiak-Tabaka I., Geo-spatial modelling with unbalanced data: modelling the spatial pattern of human activity during the Stone Age. Open Geosci., 2015, 7(1)10.1515/geo-2015-0031Search in Google Scholar

[21] Jasiewicz J., Hildebrandt-Radke I., Using multivariate statistics and fuzzy logic system to analyse settlement preferences in lowland areas of the temperate zone: an example from the Polish Lowlands. J. Archaeol. Sci., 2009, 36, 10, 2096-210710.1016/j.jas.2009.06.004Search in Google Scholar

[22] Lieskovský T., Využitie geografických informačných systémov v predikčnom modelovaní v archeológii. MS thesis, Slovak University of Technology in Bratislava, Slovakia, 2011 (in Slovak with English summary)Search in Google Scholar

[23] Banks, W.E., d’Errico, F., Dibble, H.L., Krishtalka, L., West, D., Olszewski, D.I., Peterson, A.T., Anderson, D.G., Gillam, J.C., Montet-White, A. and Crucifix, M., Eco-cultural niche modeling: new tools for reconstructing the geography and ecology of past human populations. PaleoAnthropology, 4, 2006, 68-83Search in Google Scholar

[24] Kondo Y., An ecological niche modelling of Upper Palaeolithic stone tool groups in the Kanto-Koshinetsu region, eastern Japan. The Quaternary Research (Daiyonki-Kenkyu), 54(5), 2015, 207-1810.4116/jaqua.54.207Search in Google Scholar

[25] van Leusen M., Deeben J., Hallewas D., Zoetbrood P., Kamermans H., Verhagen P., A baseline for predictive modelling in the Netherlands. In: van Leusen M., Kamermans H. (Eds.), Predictive Modelling for Archaeological Heritage Management: A Research Agenda. Nederlandse Archeologische Rapporten, 2005, 29, 25-92Search in Google Scholar

[26] Verhagen P., Case studies in archaeological predictive modelling (Vol. 14). Amsterdam University Press, Amsterdam, 200710.5117/9789087280079Search in Google Scholar

[27] Kvamme K. L., A Predictive Site Location Model on the High Plains: An Example with an Independent Test. Plains Anthropol., 1992, 37, 19-4010.1080/2052546.1992.11909662Search in Google Scholar

[28] Neustupný E. (Ed.), Space in Prehistoric Bohemia. Archeologický ústav AVČR, Praha, 1998Search in Google Scholar

[29] NeustupnyÉ., Archaeological method. Cambridge University Press, New York, 1993Search in Google Scholar

[30] File of Administrative Boundaries and Cadastral Units Boundaries of the CRSearch in Google Scholar

[31] Tencer T., Tvorba prediktívneho modelu v oblasti severozápadného Slovenska v kontexte včasného stredoveku. MUNI, Brno, 2011 (in Czech)Search in Google Scholar

[32] Matoušek J., Prediktivní model pravěkého osídlení v oblasti povodí řeky Jevišovky. MUNI, Brno, 2014 (in Czech)Search in Google Scholar

[33] Seif A., Using Topography Position Index for Landform Classification (Case study: Grain Mountain). Bull. Env. Pharmacol. Life Sci, 2014, 3, 33-39Search in Google Scholar

[34] Goláň J., Archeologické prediktivní modelovaní pomocí geografických informačních systémů: Na příkladu území jihovýchodní Moravy. MS Thesis, Masaryk University, Czech Republic, 2003 (in Czech with English summary)Search in Google Scholar

[35] Weiss A., Topographic position and landforms analysis. In: Poster presentation, ESRI user conference, San Diego, CA (Vol. 200), 2001Search in Google Scholar

[36] Gisat SRTM DEM http://www.gisat.cz/content/cz/produkty/digitalni-model-terenu/srtm-demSearch in Google Scholar

[37] SoilGrids.org https://www.soilgrids.orgSearch in Google Scholar

[38] Beven K. J., Kirkby M. J., A physically based, variable contributing area model of basin hydrology/Un modèle à base physique de zone d’appel variable de l’hydrologie du bassin versant. Hydrological Sci. J., 1979, 24(1), 43-6910.1080/02626667909491834Search in Google Scholar

[39] Ruszkiczay-Rüdiger Z., Fodor L., Horváth E., Telbisz T., Discrimination of fluvial, eolian and neotectonic features in a low hilly landscape: A DEM-based morphotectonic analysis in the Central Pannonian Basin, Hungary. Geomorphology, 2009, 104(3), 203-21710.1016/j.geomorph.2008.08.014Search in Google Scholar

[40] Package ‘stats’, version 3.2.1, 2018 https://www.rdocumentation.org/packages/stats/versions/3.2.1/topics/distSearch in Google Scholar

[41] Robinson D. J. S., An Introduction to Abstract Algebra. Walter de Gruyter, 2OO310.1515/9783110198164Search in Google Scholar

[42] Hamming R. W., Error detecting and error correcting codes. The Bell System Technical Journal, 1950, 29, 2, 147–16010.1002/j.1538-7305.1950.tb00463.xSearch in Google Scholar

[43] Package ‘sna’, 2016 https://cran.r-project.org/web/packages/sna/sna.pdfSearch in Google Scholar

[44] Dreslerová D., Kočár P., Chuman T., Pravěké osídlení, půdy a zemědělské strategie. Archeologické rozhledy, 2016,19 - 46Search in Google Scholar

[45] Hanley J. A., McNeil B. J., The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 1982, 143.1, 29-3610.1148/radiology.143.1.7063747Search in Google Scholar PubMed

[46] Lobo J. M., Jiménez-Valverde A., Real R., AUC: a misleading measure of the performance of predictive distribution models. Global Ecology. Biogeogr., 2008, 17.2, 145-15110.1111/j.1466-8238.2007.00358.xSearch in Google Scholar

[47] Květina P., Řídký J., Končelová M., Burgert P., Šumberová R., Pavlů I., Brzobohatá H., Trojánková O., Vavrečka P., Unger J., Mervart P., Minulost, kterou nikdo nezapsal. Antropos, Červený Kostelec, 2015, 349-350 (in Czech)Search in Google Scholar

[48] Sádlo J., Pokorný P., Hájek P., Dreslerová D., Cílek V., Krajina a revoluce: významné přelomy ve vývoji kulturní krajiny českých zemí. Malá Skála, Praha, 2008, 47-52 (in Czech)Search in Google Scholar

[49] Hajnalová M., Archeobotanika doby bronzovej na Slovensku: štúdie ku klíme, prírodnému prostrediu, pol’nohospodárstvu a paleoekonómii - The Archaeobotany of Bronze Age Slovakia. The studies to the climate, environment, farming and palaeoeconomy. Univerzita Konštantína Filozofa, Nitra, 2012. (in Slovak with English summary)Search in Google Scholar

[50] KolářJ., Kuneš P., Szabó P., Hajnalová M., Svitavská Svobodová H., Macek M, Tkáč P., Population and forest dynamics during the Central European Eneolithic (4500–2000 BC). Archaeol. Anthropol. Sci., 2016, 1-12. https://doi.org/10.1007/s12520-016-0446-510.1007/s12520-016-0446-5Search in Google Scholar PubMed PubMed Central

[51] Šabatová K., Archeologické doklady lidských aktivit v prostoru Tvořihrázského lesa, Studia Archaeologica Brunensia, 2013, 18, 1, 21-38 (in Czech with English summary)Search in Google Scholar

[52] Hrubý P., Hejhal P., Malý K., Kočár P., Petr L., Centrální Českomoravská vrchovina na prahu vrcholného středověku. Archeologie, geochemie a rozbory sedimentárních výplní niv. Opera Universitatis Masarykianae Brunensis Facultas Philosophica. Spisy Masarykovy univerzity v Brně Filozofická fakulta, 2014, 422 (in Czech with English summary)Search in Google Scholar

[53] Dresler P., Břeclav-Pohansko VIII. Hospodářské zázemí centra nebo jen osady v blízkosti centra? - Břeclav-Pohansko VIII. Economic Hinterland of Centre, or Merely Settlements in a Centres Vicinity? Masarykova univerzita, Brno, 2016, 214-233 (in Czech with English summary)Search in Google Scholar

Received: 2017-12-21

Accepted: 2018-03-22

Published Online: 2018-07-20

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Spatial predictive modeling of prehistoric sites in the Bohemian-Moravian Highlands based on graph similarity analysis

Abstract

1 Introduction

2 Related work

3 Study area, input dataset and data preprocessing

4 Methods

4.1 Pre-calculations

4.2 Graph analysis

4.3 Prediction model

4.4 Model evaluation

5 Results and Discussion

6 Conclusions and future works

Acknowledgement

References

Journal and Issue

Articles in the same Issue