For the neural translation models training, navigation sentence pairs were tokenized and transformed into standard word embeddings to encode latent semantic information. The training was performed in a leave-one-out cross-validation setup. Left-in datasets were merged and randomly split again into training and validation batches in a 9:1- ratio, resulting in ~ 3300 and 300 sentence pairs, respectively. The source sentences were augmented with random swap and random deletion operations, as described by [
26] to improve model generalization further. As baseline models for predicting navigation steps at class-level (Fig.
2e), a first-order hidden Markov model (HMM) with 12 hidden states and expected Gaussian distribution, as well as a 2-layered long-short term memory model (LSTM) with 200 neurons, were chosen (Fig.
3a and b). The S2S and LSTM models converged reasonably fast on this dataset. Both were trained over ten epochs with randomly assembled and shuffled data batches of size
\( b = 20 \) and a cross-entropy loss criterion. The LSTM input sequence length was set to
n = 6 steps. Batch assembly and shuffling were re-initialized for each epoch to avoid the exhibition of repetitive step-like input behavior. The TRF model converges comparably slower and was trained with the same batch preparation over 40 epochs with the Kullback–Leibler-divergence
$$ \begin{aligned} D_{KL} \left( {P_{\text{Truth}} } \right.||\left. {P_{\text{Pred}} } \right) & = \mathop \sum \limits_{ } P_{\text{Truth}} \left( y \right)\log \left( {\frac{{P_{\text{Truth}} \left( y \right)}}{{P_{\text{Pred}} \left( {\mathop { \, y}\limits^{\prime } } \right)}}} \right) \\ & \equiv H\left( {P_{\text{Truth}} ,P_{\text{Pred}} } \right) - H\left( {P_{\text{Truth}} } \right) \\ \end{aligned} $$
for discrete probability functions as the loss criterion.
\( P_{Truth} \left( y \right) \) and
\( P_{\text{Pred}} \left( {\mathop { \, y}\limits^{\prime } } \right) \) are the ground truth and predicted probability distributions of the labels
\( y \) and
\( \mathop { \, y}\limits^{\prime } \) used in the target navigation sentences. The criterion measures the inefficiency of approximating
\( P_{Truth} \left( y \right) \) with the
\( P_{\text{Pred}} \left( {\mathop { \, y}\limits^{\prime } } \right) \) and forces the latent variables during training to resemble
\( P_{\text{Truth}} \left( y \right) \). Additionally, for the longer transformer training process, a label smoothing regularization (smoothing factor = 0.1) was applied to prediction outputs as described by [
25]. The regularization step penalizes predicted labels with high confidence, by assigning reduced confidence to the target label scores. The intention is to prevent the model from over-fitting and to improve generalization effects. In all neural network training cases, model weights were updated using an Adam optimizer (
\( \beta_{1,2} = \left( {0.9,0.98} \right), \) \( \varepsilon = 1e - 9 \)) with warm-up phase and the learning rate
$$ lr = \sqrt {d_{\text{model}} } * \mathop {\hbox{min} }\limits_{ } \left( {n_{\text{step}}^{ - 0.5} ,n_{\text{step}} *n_{\text{warmup}}^{ - 1.5} } \right) $$
\( lr \) has linear growth until a warmup step size
\( n_{\text{warmup}} = 200 \) is reached and decreases proportionally to
\( n_{\text{step}}^{ - 0.5} \) after that. All training and prediction tasks were performed using Pytorch V.1.3.1 [
27]. The S2S and TRF model training were finished after 0.5 h and 8.2 h, respectively, with an average processing speed of
\( \sim 500 \) tokens per second. The LSTM model training was finished after 0.4 h. Computations were performed with CUDA 10.1 on an NVIDIA GeForce RTX 2070S graphics card. After training, models exhibited a mean loss of
\( \bar{l}_{GRU} = 0.282 \),
\( \bar{l}_{TRF } = 0.319 \) and
\( \bar{l}_{LSTM } = 0.262 \). The HMM was fitted over 500 iterations using the Baum-Welch-algorithm and reached the convergence threshold of
\( \varepsilon = 0.01 \) in computation time of less than 60 s.
Sentence translation and prediction tasks
The trained models were then used to predict navigation steps from the left-out navigation workflows. 3827 sentence pair translations and class predictions were performed. The target navigation sentences were generated one word at a time. For the word candidate search, we implemented an adapted beam search algorithm outlined in [
28], and for the context function we used a decaying factor, proposed by [
29], to penalize specific beam scores as follows:
$$ s\left( y \right) = \log p\left( {y|x} \right)*\left( {1 - d\left( y \right)} \right)\quad {\text{and}}\quad d\left( y \right) = \left( {1 - e^{{ - \left( {\frac{{r_{y} }}{{r_{y,mean} }}} \right)}} } \right) $$
Here, the score \( s\left( y \right) \) of the next word candidate \( y \) in the target sentence is calculated as the word candidates’ log-likelihood degraded by a decaying factor \( d\left( y \right) \). This factor is defined as an exponential decay function where \( r_{y} \) is the current number of subsequent recurrences of the word candidate over the series of navigation sentences and, \( r_{y,mean} \) is the mean number of subsequent recurrences of the word candidate overall sentences in the training dataset. The intuition behind this rescoring is that a word candidates’ likelihood is penalized as it is recurring more than the expected mean number of times. This ratio induces a delayed termination of the likelihood of specific word candidates and, thereby, enables other word candidates to be preferred for sentence decoding. Due to our smaller vocabulary size, we used a beam size of 4 instead of typical 16 or 32 beams in more complex text processing tasks.
Predicted and ground truth sentences were then analyzed using similarity metrics BLEU-1 and ROUGE-L (Recall), as proposed in [
30], to reflect translation precision and sentence-level structure recall. Based on the scores, we adopted an intermediate
\( F1 \)-score:
$$ F_{1,BR} = 2* \frac{BLEU\_1 *ROUGE\_L}{BLEU\_1 + ROUGE\_L} $$
Since both scores are based on n-gram matching, the \( F_{1,BR} \) approximates a translation accuracy to produce correct stepwise n-grams. For the sequence-to-sequence models, the predictive power was assessed using a positional-specific accuracy for the classification of the correct words from classification confusions. The word-level Jaccard distance was included to assess dissimilarity. For the baseline model predictions, precision and recall values were calculated.