Top

Applied Network Science

Published in:

Open Access 01-12-2018 | Research

An illustrated approach to Soft Textual Cartography

Authors: Raphaël Ceré, Mattia Egloff

Published in: Applied Network Science | Issue 1/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

We propose and illustrate an approach of Soft Textual Cartography consisting in the clustering of regions by taking into account both their spatial relationships and their textual description within a corpus. We reduce large geo-referenced textual content into topics and merge them with their spatial configuration to reveal spatial patterns. The strategy consists in constructing a complex weighted network, reflecting the geographical layout, and whose nodes are further characterised by their thematic dissimilarity, extracted form topic modelling. A soft k-means procedure, taking into account both aspects through expectation maximisation on Gaussian mixture models and label propagation, converges towards a soft membership, to be further compared with expert knowledge on regions. Application on the Wikipedia pages of Swiss municipalities demonstrate the potential of the approach, revealing textual autocorrelation and associations with official classifications. The synergy of the spatial and textual aspects appears promising in topic interpretation and geographical information retrieval, and able to incorporate expert knowledge through the choice of the initial membership.

Availability of data and materials

The datasets analysed during the current study are available on the FSO and Wikipedia web-pages: https://www.bfs.admin.ch/bfs/en/home.htmland https://en.wikipedia.org.

Correspondence analysis. 2, 10–13

FSO

Swiss federal statistical office. 2, 5, 6, 8, 14

LDA

Latent dirichlet allocation. 4

MDS

Multidimensional scaling. 9, 15

Introduction

Regional data analysis generally involves numerical or categorical information attached to the regions, such as level intensities or densities provided from census data (e.g. population, socio-economical properties). Another rich information source that should be considered in regional data analysis is “common textual knowledge”. Yet, the question of how to exploit this type of data in quantitative methods is generally not trivial. On one hand, textual data may require human interpretation to be used meaningfully and its use in quantitative methods is not straightforward. On the other hand, when evaluating an algorithm, textual data can be useful to provide insight in the results.

In this paper, we first show how it is possible to use textual data in regional geography, and more precisely how to extract textual distances and use them in an adapted clustering algorithm. Secondly, we address the question: how to interpret the clusters obtained from the algorithm in view of, textual and regional characteristics, and using expert knowledge? From a geographical perspective, this second idea follows (Grady and Funka-Lea 2004), which argues that fully automated spatial data analysis does not exploit the advantage of the practitioner’s input performing the classification. Indeed, a person has to evaluate the results of any automated procedure without knowing exactly how the latter was really performed. Even more, the similarity between administrative entities depends on the points of view. For example a territorial network admits several “valid” classifications corresponding on the nature of the analysis, interest or study objectives. Thus, the knowledge provided from the practitioner can be included by specifying in a clustering task, initial memberships to infer the segmentation in a certain aim with keeping the advantage of automated approach. Also, the memberships yielded by the method can be analysed by the practitioner to identify interesting spatio-textual patterns as well as used to refine the initial membership of the algorithm; leading to an iterative clustering approach.

Methodologically this paper uses “Soft Textual Cartography”, as previously developed in (Egloff and Ceré 2018). Textual information is used with the method of regional semi-automated soft clustering proposed by Ceré and Bavaud (Ceré and Bavaud 2017; 2018). That implements the combination of spatial configuration and features distances in an image segmentation framework (see (Youssef Mourchid and Cherifi 2017) for a conceptually comparable approach) to perform semi-automated regional segmentation.

We improve the results as presented in Egloff and Ceré (Egloff and Ceré 2018). Applying the method on a larger dataset composed by all the municipalities of Switzerland. It furthermore, emphasises the role of the initial memberships in the iterative procedure. Also, the analysis of the results is clarified by means of correspondence analysis (CA) between different memberships. For the validation of the obtained memberships we use an official classification provided by the Swiss Federal Statistical Office (FSO).

The paper is structured as follows: section “Data” introduces the basic ingredients necessary for the “soft textual cartography” and the data used for the illustration of the method. Then, in section “Soft Textual Cartography” we introduce the heart of the method explaining: the extraction of the weighted spatial network, the textual distance obtained from topic modelling on the corpus, the spatial autocorrelation and finally, the clustering algorithm itself. In section “Parameter choice and initial conditions”, different initial memberships used to test the model are described, among which the official classification. Section “Results” presents a method to evaluate membership association and analyses some results obtained by the algorithm and compares it to a classical approach. Finally, section “Conclusions” draws some conclusions about the usage of the algorithm.

Data

Soft textual cartography requires a minimal amount of elements (Egloff and Ceré 2018), namely a dataset of n regions with relative weights f_i>0, ${\sum \nolimits }_{i}^{n}{f_{i}} = 1$, reflecting their surface, population, or description size. Also each region has to be associated with a text, such as a descriptive document, involving a total variety of N words. The final element consists in the spatial configuration, which is defined by the binary adjacency matrix A=(a_ij) of size n×n with values 1 if i and j are distinct and neighbours, and 0 otherwise.

Textual data consists of the Wikipedia pages (Wikipedia 2018; DBpedia 2017) of the n=2068 municipalities of Switzerland. To keep a spatial continuum, municipalities of Liechtenstein, as well as foreign enclaves (Campione d’Italia and Büsingen am Hochrein) present in the Swiss territory have been included.

Textual sections about important regional personalities as well as external links have been removed. Also, all references to cantons and municipality names have been withdrawn along with the usual stop-words. Finally, low- and high-frequency terms (respectively less than 20 and more than 9000 occurrences) have been also removed (Lu et al. 2017; Xu et al. 2017). Figure 1 shows the resulting weight-frequency f. This f is reflects the textual volume of information of the Wikipedia pages and defines the relative weight of the municipalities as used in the algorithm.

Soft Textual Cartography

This section introduces the ingredients involved in the algorithm, in particular the neighbourhood network of the municipalities and the textual distance. Then, with the help of Moran’s I, we measure the textual autocorrelation relative to the spatial configuration. Finally, we introduce a particular version of the algorithm leveraging on our previous work (Egloff and Ceré 2018).

Weighted spatial network

The spatial connectivity between the n regions is expressed by a (n×n) symmetric non-negative exchange matrix$ E(A, f, t)=(e_{ij}^{\scriptscriptstyle (t)})$. The latter specifies the joint probability to select the unoriented edge ij as prescribed from the time-continuous Markov diffusive process with jump generator A at time t>0; the so called Laplacian diffusion kernel of machine learning (Smola and Kondor 2003; Fouss et al. 2016) constitutes an unoriented unweighed network. Note that the transition matrix $w_{ij}(t)=e_{ij}^{\scriptscriptstyle (t)}/f_{i}$ is reversible and has a stationary distribution f. The weight-compatible $e_{i\bullet }={\sum \nolimits }_{j=1}^{n} e_{ij}=f_{i}$ (Bavaud 2013) diffusive exchange matrix constitutes a weighted generalisation of the unweighed approach using diffusive kernel. Its limit ${\lim }_{\ t\to 0} e_{\scriptscriptstyle ij}^{\scriptscriptstyle (t)}=f_{i}\delta _{ij}$ depicts a network made of disconnected nodes, while ${\lim }_{\ t\to \infty } e_{\scriptscriptstyle ij}^{\scriptscriptstyle (t)}=f_{i}f_{j}$ represents a complete weighted network.

Textual distance

There are several possible ways to extract distances between the municipalities from textual data. For the approach illustrated a topic distance is defined as follows. First, we define the N×n term-municipality matrix as the matrix associating each term with its frequency in the document corresponding to each municipality. In a second step we use the Latent Dirichlet Allocation (LDA) (Blei et al. 2003) algorithm to extract the latent k topics from the texts, from which the χ² distances are finally extracted (see below).

The main idea behind LDA is that a document is conceived as a random mixture over k latent topics and each topic a random mixture over the terms or words. The topics obtained from LDA generally are able to regroup words used in similar contexts (semantically correlated or synonyms) into the same topic or theme, namely a set of terms. Consequently, a word possessing more than one sense can belong with a high probability to more than one topic (for example: see “businesses” in topics V2 and V5 in Fig. 3). Furthermore, the theme is mappable to its spatial configuration. The resulting maps can be used for a visual interpretation of geographical socio-economical phenomena. For instance, the topic V4 in Fig. 3 highlights clearly historically established urban regions such as Zurich and Geneva, and is mainly associated to the terms: “city”, “town” and “century”. In this paper we use the Gibbs sampling method to approximate the solution of the LDA to as implemented in the R package topicmodels (Grün and Hornik 2011)).

As the municipalities are in a one-to-one correspondence with the documents: the probability distributions of the municipalities over the topics is defined as the row-normalised (n×k) document-topic matrix R=(r_iq), and the probability distributions of the terms over the topics is defined as the row-normalised (N×k) term-topic matrix C=(c_lq). The latter permits an interpretation of the topics, whereas the R matrix is used to extract topic distances between the regions.

To extract the (n×n) topic-distance D=(d_ij) from the previously defined municipality-topic matrix R the χ² distance $d_{ij}^{\chi }={\sum \nolimits }_{q=1}^{k}(r_{iq}-r_{jq})^{2}/R_{k}$ (where $R_{q}={\sum \nolimits }_{i=1}^{n}f_{i}r_{iq}$ are the topic weights) is computed between the topic distributions of the municipalities, i.e. the rows of the R matrix. Figures 2 & 3 depict the topic probabilities of the Swiss municipalities; noticeably the topics extracted seem to be spatially autocorrelated.

Spatial autocorrelation

Obviously, the basic spatial statistical analysis or classification of an spatial data set makes sense only if a spatial autocorrelation is present. The Moran’I provides an index of spatial autocorrelation (Anselin 2010) measuring to which extent the topic-distance D is smaller between spatially close municipalities, as defined by the spatial configuration E. We use here the weighted, multivariate generalisation of Moran’s I where the spatial autocorrelation significance is evaluated with the standardised test value z (e.g. (Bavaud 2013; Ceré and Bavaud 2017))

$$ I\equiv I(E,D)= \frac{\Delta \ - \ \Delta_{\text{{loc}}}}{\Delta}\qquad \text{with} \qquad z= \frac{|I - E_{0}(I)|} {\sqrt{\text{Var}_{0}(I)}} $$

(1)

$$ \text{where} \quad\quad \Delta=\frac{1}{2}\sum\limits_{i,j=1}^{n} f_{i} f_{j} D_{ij} \quad\quad \text{and} \quad\quad \Delta_{\text{{loc}}}=\frac{1}{2}\sum\limits_{i,j=1}^{n} e_{ij} D_{ij} $$

(2)

respectively define the total inertia between all regions and the local inertia between connected regions. The Fig. 4 shows the measured I, ranges in [−1,1], where a large positive value is expected when the topic distributions between neighbours are close.

The Algorithm

As a reminder, the soft clustering method already described in (Egloff and Ceré 2018) is reproduced in this section, with minor adaptations. This approach combines textual information and spatial configuration independently. Notice that the initial membership or partition Z⁰ can be used other information (e.g. expert knowledge).

The assignment of n objects to m groups is represented by the non-negative and row-normalised (n×m) membership matrix Z=(z_ig), where z_ig denotes the probability p(g|i) that region i belongs to group g. In the general soft case, z_ig≥0 with ${\sum \nolimits }^{m}_{g=1} z_{ig} = z_{i\bullet } = 1$, whereas z_ig=0 or z_ig=1 in the hard case.

The soft regional clustering for communities detection (Ceré and Bavaud 2017; 2018) is initialised with initial membership $Z^{0} = \left (z^{0}_{ig}\right)$ and is using expectation maximisation to produce the final assignment. Explicitly, a good membership is defined as local minima of the generalised discontinuity free energy functional${\mathcal {F}}[Z]$ from Z⁰:

$$ {\mathcal{F}}[Z]={\mathcal{K}}[Z]+\beta\Delta_{W}[Z]+\frac{\alpha}{2}{\mathcal{G}}[Z] $$

(3)

where the regularising entropy term ${\mathcal {K}}[Z]$, favouring the advent of soft clustering, is the mutual information between the n regions and the m groups

$$ {\mathcal{K}}[Z] = \sum\limits_{ig} f_{i} z_{ig} \ln \frac{z_{ig}}{\rho_{g}} \qquad\qquad \rho_{g}=\sum\limits_{i=1}^{n} f_{i}z_{ig} $$

(4)

where ρ_g is the group weight. The second term $\Delta _{W}[Z]={\sum \nolimits }_{g=1}^{m}\rho _{g}\Delta _{g}$ is the within-group inertia relatively to the topic distances, whose presence supports the constitution of group of regions homogeneous enough relatively to the topic distributions, where (Bavaud 2009)

$$ \Delta_{g} =\frac12\sum\limits_{ij}f_{i}^{g}f_{j}^{g} D_{ij} \qquad\qquad f_{i}^{g}=p(i|g)=\frac{f_{i}z_{ig}}{\rho_{g}} $$

(5)

The third discontinuity term ${\mathcal {G}}[Z]={\sum \nolimits }_{g=1}^{m}\rho _{g}^{-1} \varepsilon [z^{g}]$ and $\varepsilon [z^{g}]=\frac 12{\sum \nolimits }_{ij}e_{ij}(z_{ig}-z_{jg})^{2}$, insures the spatial continuity of the group memberships. As for ${\mathcal {K}}[Z]$, the “spatial energy” ${\mathcal {G}}[Z]$ favours the constitution of soft clusters, in contrast to the “feature energy” Δ_W[Z] which favours hard memberships obeying z_ig=0 or z_ig=1 (Bavaud 2009).

The parameter β>0 controls the influence of topic distances, while α=0 coincides with the soft k-means algorithm based on spherical Gaussian mixtures.

Minimising the free energy functional (3) is performed by cancelling the first-order derivative under the conditions z_i∙=1 and yields:

$$ z_{ig}=\frac{\rho_{g} \exp\left(- \beta D_{i}^{g}+\alpha\rho_{g}^{-1} ({\mathcal{L}}z^{g})_{i}-\frac{\alpha}{2}\rho_{g}^{-2} \varepsilon[z^{g}]\enspace\right)}{{\sum\nolimits}_{h} \rho_{h} \exp\left(- \beta D_{i}^{h}+\alpha\rho_{h}^{-1} ({\mathcal{L}}z^{h})_{i}-\frac{\alpha}{2}\rho_{h}^{-2} \varepsilon[z^{h}]\right)} $$

(6)

where $D_{i}^{g}$ the standardised¹ squared Euclidean dissimilarity from i to the centroid of group g and $({\mathcal {L}}z^{g})_{i}$ is the Laplacian of membership z^g at region i, comparing its value to the average value of its neighbours as defined by the matrix W - an ingredient typical of label propagation models.

Equation (6) is solved iteratively until convergence. The choice of the initial membership matrix Z⁰ is discussed in section “Parameter choice and initial conditions”. The hardness of the final membership matrix Z^∞ can possibly be measured by the value of the mutual information ${\mathcal {K}}[Z^{\infty }]$. Also, the point-wise conditional entropy $H(G|i)=-{\sum \nolimits }_{g} z^{\infty }_{ig}\ln z^{\infty }_{ig}$ (where G denotes the variable “group”) measures the membership uncertainty of region i, and takes on large values for regions located at the group frontiers. Alternatively, the final membership matrix can be further hardened by assigning each region i to group $G[i]=\arg \max _{g \in \{1, \dots, m\}} z_{ig}^{\infty }$.

Parameter choice and initial conditions

To illustrate the algorithm and study the influence of the initial membership Z₀, the following parameter choices were made. First, parameter k (the number of topics) was chosen to be the same as the number of groups m, thus k=m. In turn, m was chosen to correspond to the numbers of groups presented in the three official municipality classifications issued by the FSO, namely m=3, m=9 and m=25. The value for parameters β and α of the soft clustering algorithm have been tuned by numerical experimentation. The free parameter β, which can be interpreted as the inverse temperature in statistical mechanics, controls the hardness of the classification. The free parameter α controls the extent to which the spatial configuration is taken into account. Finally, the parameter t controls the age of diffusive process: a low t limits the interactions to the nearest neighbours.

To use the clustering algorithm proposed in section “The Algorithm” an initial membership matrix Z⁰ is required. To study the impact of the initial membership we went beyond the method proposed in (Egloff and Ceré 2018), where pre-selected municipalities were used based on their atypicality in the correspondence analysis over the topics (their distance towards the mean profile). Hence, three different initial membership attributions:

three official classifications, m∈{3,9,25}, from the FSO based on a urban-rural model, see Fig. 5,

Fig. 5
Illustration of the maps of the official classifications, from a to c the parameters are: m=3; m=9; m=25

×
two random memberships (soft and hard) for each municipality i to the group g, m=k, where the number of the topics is m∈{3,9,25}, see Fig. 6,

Fig. 6
The maps of the groups obtained by hard k-means clustering on the MDS over the generalised χ² distance between municipality profiles in the term-municipality matrix. From a to c the parameters are: m=3, θ=1.01; m=9, θ=0.5; m=25, θ=1.5

×
and three hard memberships, m∈{3,9,25}, obtained from the k-means algorithm on the generalised χ² distance see subsection “APPENDIX: Generalised chi square distance and term-document distance” obtaining from the region-document matrix, represented in Fig. 7.

Fig. 7
The maps of the groups obtained by hard k-means clustering on the MDS over the generalised χ² distance between municipality profiles in the term-municipality matrix. From a to c the parameters are: m=3, θ=1.01; m=9, θ=0.5; m=25, θ=1.5

×

Official classifications

The official municipalities classifications, m=3,9,25, of Switzerland (Zecha et al.) (version 2017) is based on the delimitation of the urban space in 2012 based upon morphological (density) and functional (commuting flows) conditions. The m=9 categories include the size and the accessibility of the municipalities. The so called rural-urban typology m=3 depicts the “Urban (1)”, “Intermediary (2)”, “Rural (3)” municipalities which is based on the classification m=9. The m=25 categories distinguishes by socio-economic conditions in municipalities. The details of how those typologies have been determined are not further investigated here; those typologies are used here as the “gold standard” to compare the results further obtained.

Random memberships

For further testing, we first create random memberships where each region is uniformly assigned to groups g=1,…,m. Three of them are illustrated in Fig. 6.

Membership based on word-frequency

To test the algorithm further we compute another initial membership based on the term frequencies: we first define a distance based on the term-municipality matrix (defined in 2). To do this, we used the generalised χ² distance (see “APPENDIX: Generalised chi square distance and term-document distance”) to compute the distance between the municipalities with respect to their word frequency profile. Figure 7 depicts three examples of groups obtained by submitting the distance obtained to an MDS to which we applied a hard k-means clustering (Hartigan and Wong 1979) with the R package stats (R Core Team 2017). As shown in Fig. 7 this type of clustering has a tendency, depending on the value of θ, to create patches of municipalities that either have frequent or rare words in their Wikipedia page. It is not self evident that these patches should be spatially contiguous.

Results

In this section, we introduce membership association between two memberships, which is later used to compare the results of the algorithm with the official classifications. Then, for each initial membership discussed in section “Parameter choice and initial conditions” we briefly analyse some results. Finally, we compare the present soft textual cartography approach to two classical approaches based on a network obtained from an affinity matrix.

Membership association

Starting with the initial membership Z⁰, the iterative algorithm (6) converges towards a local minimum Z^∞ of the free energy. Z^∞ constitutes a soft membership, which can be further hardened for interpretation purposes, by entirely assigning each municipality i to group $G[i]=\arg \max _{g \in \{1, \dots, m\}} (z^{\infty }_{ig})$. On one hand, the iterative process, depending only on the weighted geographical network as well as the the topic-induced distances, should erase in large part the initial attribution Z⁰ of municipalities to groups. On the other hand, procedures such as the k-means, soft k-means and their variants are well-known to exhibit sensitive dependence on initial conditions, that is the local minimum Z^∞ does in general depend on the initial membership Z⁰.

To compare two classifications, Z with m groups (such as the result of the clustering, hardened or not) and Y with $\tilde {m}$ group (such as the official classification), one can first define the $m\times \tilde {m}$overlap matrix${\mathcal {T}} = (\tau _{gh})$

$$ \tau_{gh} = \sum\limits^{n}_{i=1}f_{i} z_{ig} y_{ih} $$

(7)

whose margins give by construction the group weights ρ_g=τ_g∙ and $\pi _{h}={\sum \nolimits }_{i}f_{i} y_{ih}=\tau _{\bullet h}$. The matrix ${\mathcal {T}}$ constitutes a normalised version of the contingency table $N {\mathcal {T}}$ (where N is the total number of terms in the corpus), whose chi-square attests, expectedly and in all the instances encountered in this work, a very significative dependence between both classifications. Their association can be further investigated by performing a CA on ${\mathcal {T}}$, the resulting bi-plots (Figs. 8, 9, 10, 11, 12, 13, 14) permitting to identify which groups g=1,…,m of Z possibly correspond to which groups $h=1,\ldots,\tilde {m}$ of Y, and to which extent.

Random initial membership

Starting with random memberships as illustrated in subsection “Random memberships” in 2 permits to test how the algorithm behaves when there is not any preliminary information available on the groups.

As depicted in Figs. 8 & 9 the algorithm produced groups which match surprisingly well the official classifications. This result could imply that the different types of municipalities (in the case of m=3, the official groups being: “Urban (1)”, “Intermediary (2)”, “Rural (3)”) are reflected by the topics present in the text of their Wikipedia page. For m=9 and m=25, the match between the official classification and the detected ones is thinner: it could be the case that some types of official groups are less reflected in the topics that the three broad categories of m=3, for example “urban of a big agglomeration (11)” and “urban of a mean agglomeration (12)”: those categories make sense from a classification perspective, as they correlate to population and density, but are harder to extract from the Wikipedia description.

Official groups as initial membership

To test if the algorithm minimises correctly given ideal initial memberships representing the practitioner’s knowledge or an official classification, and to verify the intuition that some official categories are more difficult retrieve from the textual description of the municipalities, the initial membership was set to correspond to the official one. Figure 10, for m=3 this initial membership yields, as expected, a better result than the random initial membership. Figure 11, for m=9 and m=25, the choice of initial memberships is less crucial, and the intuition that some groups proposed by the FSO are harder to recover in the corpus of Wikipedia pages is thus confirmed.

Initial membership based on word frequency

We explored another approach using memberships obtained by using the hard k-means algorithm on the generalised χ² distance (see section “APPENDIX: Generalised chi square distance and term-document distance”) of the municipalities in the term-document matrix. This choice of the initial memberships constitutes an intermediate case between randomness and complete information, and inherits its initial memberships from a distance where the terms can be over-weighted using parameter θ. Initial memberships reflect common usage of rare or frequent words (respectively using θ<1 or θ>1) which can be interpreted as a partial knowledge on the textual similarity between municipalities. The results are consistent with the two cases previously observed (see sections “Random memberships” and “Official groups as initial membership”).

Comparison with a classical approach

How to combine the spatial configuration E of the regions with their textual distances D in order to build a complex network on which clustering or boundary detection are then applied is not a trivial question.

An alternative, more classical approach is to combine the textual dissimilarity D_ij with the spatial proximity $e_{\scriptscriptstyle ij}^{\scriptscriptstyle (t)}$ used in graph image segmentation (Lézoray and Grady 2012; Solem 2012) which yields the pairwise region affinity S=(s_ij) as in:

$$ s_{ij}=\frac{e_{ij}^{(t)}}{f_{i}f_{j}}\: \exp(-\lambda\: D_{ij}) $$

(8)

where the spatial component $e_{ij}^{(t)}/f_{i}f_{j}$ compares the spatial interaction of order t between regions i and j to its expected value under independence. The free parameter λ>0 controls the pairwise similarity. The higher s_ij, stronger is the interaction along the edge ij.

For a general comparison we used the well known community detection algorithm Infomap (Rosvall et al. 2009), from the igraph python package (Csardi and Nepusz 2006) on this network, which turned out to detect n/2 communities, irrespectively of the values of parameters. This result could be expected as S yields a complete network and the degrees of municipalities are more or less the same.

Another classical community detection algorithm is spectral clustering (von Luxburg 2007). We used the python package scikit-learn (Pedregosa et al. 2011) to perform it on the affinity matrix S. Figure 14 shows interesting results, where the correspondence between memberships obtained from spectral clustering and the official classification are already quite good.

Conclusions

This paper exposes and explores the application of the soft clustering algorithm to the exploration of a spatial and thematic corpus based on the Wikipedia pages of Swiss municipalities. We focused the analysis on the impact of differing initial memberships on the results, in order to explore the robustness of the algorithm; the matching of the latter to the official classifications, permitting to incorporate the practitioner’s knowledge in the analysis, namely the socio-economical and geographical categorisation of municipalities.

This study has permitted, on one hand, to show that the algorithm strongly depends on the textual or topic distances in use, but is otherwise less sensitive to the initial memberships. On the other hand, the association of the groups computed by the algorithm with the official classification of the municipalities is surprisingly high. Finally, the results demonstrate that the Wikipedia pages of the municipalities constitute a corpus that is both spatially and thematically correlated.

This flexible semi-automated approach shows its potential at the exploration stage for large spatio-textual dataset: on one hand the initial membership provides a means to direct the algorithm based on available knowledge, on the other hand this knowledge can be created by interpreting the results. For this task, the interpretation of the topics with respect to their spatial configuration (e.g. geographical) and their defining words is of great value. Consequently, the algorithm can be used as a part of a semi-automatic iterative clustering retaining both aspects of the regions, namely their textual content and their spatial configuration.

APPENDIX: Generalised chi square distance and term-document distance

The generalised χ² distance defined in (11) provides a parameter θ which enables to control if the distance should be more sensible to high or low frequencies in the distributions. To define this distance let U=(u_il) be the (n×N) document-term matrix, counting the number of occurrences of term l in document i. The relative document-weights f, term-weights v and quotients η are

$$ f_{i}=\frac{u_{i\bullet}}{u_{\bullet\bullet}} \qquad\qquad v_{l}=\frac{u_{\bullet l}}{u_{\bullet\bullet}} \qquad\qquad \eta_{il}=\frac{u_{il}\: u_{\bullet\bullet}}{u_{i\bullet}\: u_{\bullet l}} $$

(9)

The χ² distance between documents i and j is

$$ d_{ij}=\sum\limits_{l}v_{l} (\eta_{il}-\eta_{jl})^{2} \text{.} $$

(10)

And the generalised χ² distance is defined as:

$$ d_{ij}=\sum\limits_{l}v_{l} (\varphi(\eta_{il})-\varphi(\eta_{jl}))^{2} \text{ where} \varphi(\eta) \text{ is any increasing function. } $$

(11)

By construction d_ij defines a squared Euclidean distance between documents i and j, thus Multidimensional Scaling (MDS) (Bavaud 2004) can be performed.

For instance consider φ(η)=η^θ with θ≥0. The case θ=1 yields the usual χ² distance. θ>1 overweights the contribution of frequent terms, and θ<1 overweights the contribution of rare terms. The case θ=1/2 yields the so-called Hellinger distance (Deza and Deza 2009), and θ→0 yields the presence-absence dissimilarity:

$$ {\lim}_{\theta\to 0+}d^{(\theta)}_{ij}=V_{ij^{c}}+V_{i^{c}j} $$

(12)

where $V_{ij^{c}}={\sum \nolimits }_{l; l\in i, l\notin j}v_{l}\phantom {\dot {i}\!}$ is the total weight of terms present in i but not in j, and $V_{i^{c}j}\phantom {\dot {i}\!}$ is defined analogously.

Acknowledgments

We would like to thank Professor François Bavaud - our thesis supervisor - for his crucial expert advice in clustering and textual methods, his review as well as steadfast encouragement and confidence.

Funding

University of Lausanne, Switzerland.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Availability of data and materials

The datasets analysed during the current study are available on the FSO and Wikipedia web-pages: https://www.bfs.admin.ch/bfs/en/home.htmland https://en.wikipedia.org.

previous article Community evolution in patent networks: technological change and network dynamics

next article Comparison of physician networks constructed from thresholded ties versus shared clinical episodes

D is divided by $\Delta =\frac {1}{2}\sum _{i, j=1}^{n} f_{i}f_{j}D_{ij}$ which amounts to recalibrate the value of the free parameter β.

Anselin, L (2010) Local Indicators of Spatial Association-LISA. Geogr Anal 27(2):93–115.CrossRef

Bavaud, F (2009) Aggregation invariance in general clustering approaches In: Advances in Data Analysis and Classification 3.3, 205–225.. Springer, Berlin.

Bavaud, F (2013) Testing spatial autocorrelation in weighted networks: the modes permutation test In: J Geogr Syst, vol 3.15, 233–247.. Springer, Berlin.

Bavaud, F (2004) Generalized factor analyses for contingency tables In: Classification, Clustering, and Data Mining Applications, 597–606.. Springer, Berlin.CrossRef

Blei, DM, Ng AY, Jordan MI (2003) Latent Dirichlet Allocation In: Journal of Machine Learning Research 3, 993–1022. ISSN: 1532-4435.

Ceré, R, Bavaud F (2017) Multi-labelled Image Segmentation in Irregular, Weighted Networks: A Spatial Autocorrelation Approach In: GISTAM 2017 - Proceedings of the 3rd International Conference on Geographical Information Systems Theory, Applications and Management, vol 1, 62–69.. SciTePress, Setúbal.

Ceré, R, Bavaud F (2018) Soft image segmentation: on the clustering of irregular, weighted, multivariate marked networks. In: Springer Berlin Heidelberg (ed)Geographical Information Systems Theory, Applications and Management. Third International Conference, GISTAM 2017, Porto, Portugal, April 27-28, 2017, Revised Selected Papers. Communications in Computer and Information Science CCIS.

Csardi, G, Nepusz T (2006) The igraph software package for complex network research. InterJournal Complex Systems:1695. https://igraph.org. Accessed 27 Mar 2018.

DBpedia (2017) DBpedia. https://dbpedia.org. Accessed 27 Mar 2018.

Deza, MM, Deza E (2009) Encyclopedia of Distances. Springer-Verlag, Berlin Heidelberg.CrossRefMATH

Egloff, M, Ceré R (2018) Soft Textual Cartography Based on Topic Modeling and Clustering of Irregular, Multivariate Marked Networks. In: Cherifi C, Cherifi H, Karsai M, Musolesi M (eds)Complex Networks & Their Applications VI, 731–743.. Springer.

Fellows, I (2014) Wordcloud: Word Clouds, (R package version 2.5). https://CRAN.R-project.org/package=wordcloud. Accessed 27 Mar 2018.

Fouss, F, Saerens M, Shimbo M (2016) Algorithms and models for network data and link analysis. Cambridge University Press, Cambridge.CrossRef

Grady, L, Funka-Lea G (2004) Multi-label Image Segmentation for Medical Applications Based on Graph-Theoretic Electrical Potentials. In: Sonka M, Kakadiaris IA, Kybic J (eds)Computer Vision and Mathematical Methods in Medical and Biomedical Image Analysis, 230–245.. Springer.

Grün, B, Hornik K (2011) topicmodels: An R Package for Fitting Topic Models. J Stat Softw 40.13:1–30.

Hartigan, JA, Wong MA (1979) Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc: Ser C: Appl Stat 28(1):100–108.MATH

(2012) Image processing and analysis with graphs: theory and practice. Digital imaging and computer vision series(Lézoray O, Grady L, eds.). Taylor & Francis, Boca Raton.

Lu, K, Cai X, Ajiferuke I, Wolfram D (2017) Vocabulary size and its effect on topic representation. Inf Process Manag 53.3:653–665.CrossRef

von Luxburg, U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416.MathSciNetCrossRef

Pedregosa, F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine Learning in Python. J Mach Learn Res 12:2825–2830.MathSciNetMATH

R Core Team (2017) R: A Language and Environment for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed 27 Mar 2018. R Foundation for Statistical Computing.

Rosvall, M, Axelsson D, Bergstrom CT (2009) The map equation. Eur Phys J Spec Top 178.1:13–23.CrossRef

Smola, AJ, Kondor R (2003) Kernels and regularization on graphs. vol 2777 In: COLT, 144–158.. Springer, Berlin.

Solem, JE (2012) Programming Computer Vision with Python - Tools and algorithms for analyzing images. O’Reilly Media, Inc., Sebastopol.

Xu, Y, Yin Y, Yin J (2017) Tackling topic general words in topic modeling. Eng Appl Artif Intell 62:124–133.CrossRef

Wikipedia (2018) Wikipedia, The Free Encyclopedia. https://en.wikipedia.org. Accessed 27 Mar 2018.

Youssef Mourchid, MEH, Cherifi H (2017) An Image Segmentation Algorithm based on Community Detection In: Complex Networks & Their Applications V Proceedings of the 5th International Workshop on Complex Networks and their Applications (COMPLEX NETWORKS 2016), 821–830.. Springer, Berlin.

Zecha, L, Kohler F, Goebel VNiveaux géographiques de la Suisse. Typologie des communes et typologie urbain-rural 2012. Tech. rep.

Title: An illustrated approach to Soft Textual Cartography
Authors: Raphaël Ceré
Mattia Egloff
Publication date: 01-12-2018
Publisher: Springer International Publishing
Published in: Applied Network Science / Issue 1/2018
Electronic ISSN: 2364-8228
DOI: https://doi.org/10.1007/s41109-018-0087-y

Springer Professional

An illustrated approach to Soft Textual Cartography

Abstract

Availability of data and materials

Introduction

Data

Soft Textual Cartography

Weighted spatial network

Textual distance

Spatial autocorrelation

The Algorithm

Parameter choice and initial conditions

Official classifications

Random memberships

Membership based on word-frequency

Results

Membership association

Random initial membership

Official groups as initial membership

Initial membership based on word frequency

Comparison with a classical approach

Conclusions

APPENDIX: Generalised chi square distance and term-document distance

Acknowledgments

Funding

Competing interests

Publisher’s Note

Availability of data and materials

Premium Partner

Springer Professional

Abstract

Availability of data and materials

Introduction

Data

Soft Textual Cartography

Weighted spatial network

Textual distance

Spatial autocorrelation

The Algorithm

Parameter choice and initial conditions

Official classifications

Random memberships

Membership based on word-frequency

Results

Membership association

Random initial membership

Official groups as initial membership

Initial membership based on word frequency

Comparison with a classical approach

Conclusions

APPENDIX: Generalised chi square distance and term-document distance

Acknowledgments

Funding

Competing interests

Publisher’s Note

Availability of data and materials

Other articles of this Issue 1/2018

Identifying network structure similarity using spectral graph theory

Co-occurrence simplicial complexes in mathematics: identifying the holes of knowledge

Replicator equation on networks with degree regular communities

Comparison of physician networks constructed from thresholded ties versus shared clinical episodes

A spatially heterogeneous network-based metapopulation software model applied to the simulation of a pulmonary tuberculosis infection

A multiplex financial network approach to policy evaluation: the case of euro area Quantitative Easing

Premium Partner