Introduction
Data
Soft Textual Cartography
Weighted spatial network
Textual distance
R
package topicmodels
(Grün and Hornik 2011)).Spatial autocorrelation
The Algorithm
Parameter choice and initial conditions
-
three official classifications, m∈{3,9,25}, from the FSO based on a urban-rural model, see Fig. 5,×
-
two random memberships (soft and hard) for each municipality i to the group g, m=k, where the number of the topics is m∈{3,9,25}, see Fig. 6,×
-
and three hard memberships, m∈{3,9,25}, obtained from the k-means algorithm on the generalised χ2 distance see subsection “APPENDIX: Generalised chi square distance and term-document distance” obtaining from the region-document matrix, represented in Fig. 7.×
Official classifications
Random memberships
Membership based on word-frequency
R
package stats
(R Core Team 2017). As shown in Fig. 7 this type of clustering has a tendency, depending on the value of θ, to create patches of municipalities that either have frequent or rare words in their Wikipedia page. It is not self evident that these patches should be spatially contiguous.Results
Membership association
Random initial membership
Official groups as initial membership
Initial membership based on word frequency
Comparison with a classical approach
igraph
python package (Csardi and Nepusz 2006) on this network, which turned out to detect n/2 communities, irrespectively of the values of parameters. This result could be expected as S yields a complete network and the degrees of municipalities are more or less the same.scikit-learn
(Pedregosa et al. 2011) to perform it on the affinity matrix S. Figure 14 shows interesting results, where the correspondence between memberships obtained from spectral clustering and the official classification are already quite good.