Sequence-to-sequence (seq2seq) methods using encoder-decoder schemes are a popular choice for several tasks such as machine translation, text summarization and question answering [
88]. However, encoder’s contextual representations are uncertain when dealing with long-range dependencies. To address these drawbacks, Vaswani et al. [
92] introduced a novel network architecture, called transformer, relying entirely on self-attention units to map input sequences to output sequences without the use of RNNs. The transformer’s decoder unit architecture contains a masked multi-head attention layer, followed by a multi-head attention unit and a feed-forward network, whereas the decoder unit is almost identical without the masked attention unit. Multi-head self-attention layers are calculated in parallel facing the computational costs of regular attention layers used by previous seq2seq network architectures. In [
17] the authors presented a model that is founded on findings from various previous studies (e.g., [
14,
38,
73,
77,
92]), which achieved state-of-the-art results on eleven NLP tasks, called BERT—bidirectional encoder representations from transformers. The BERT training process is split into two phases: the unsupervised pre-training phase and the fine-tuning phase using labeled data for down-streaming tasks. In contrast with previous proposed models (e.g., [
73,
77]), BERT uses masked language models (MLMs) to enable pre-trained deep bidirectional representations. In the pre-training phase, the model is trained with a large amount of unlabeled data from Wikipedia, BookCorpus [
104] and WordPiece [
98] embeddings. In this training part, the model was tested on two tasks; on the first task, the model randomly masks 15% of the input tokens aiming to capture conceptual representations of word sequences by predicting masked words inside the corpus, whereas in the second task, the model is given two sentences and tries to predict whether the second sentence is the next sentence of the first. In the second phase, BERT is extended with a task-related classifier model that is trained on a supervised manner. During this supervised phase, the pre-trained BERT model receives minimal changes, with the classifier’s parameters trained in order to minimize the loss function. Two models presented in [
17], a “Base Bert” model with 12 encoder layers (i.e., transformer blocks), feed-forward networks with 768 hidden units and 12 attention heads, and a “Large Bert” model with 24 encoder layers 1024 feed-the pre-trained Bert model, an architecture almost identical with the aforementioned transformer network. A [CLS] token is supplied in the input as the first token, the final hidden state of which is aggregated for classification tasks. Despite the achieved breakthroughs, the BERT model suffers from several drawbacks. Firstly, BERT, as all language models using transformers, assumes (and pre-supposes) independence between the masked words from the input sequence, and neglects all the positional and dependency information between words. In other words, for the prediction of a masked token both word and position embeddings are masked out, even if positional information is a key-aspect of NLP [
15]. In addition, the [MASK] token, which is substituted with masked words, is mostly absent in fine-tuning phase for down-streaming tasks, leading to a pre-training fine-turning discrepancy. To address the cons of BERT, a permutation language model was introduced, so-called XLnet, trained to predict masked tokens in a non-sequential random order, factorizing likelihood in an autoregressive manner without the independence assumption and without relying on any input corruption [
100]. In particular, a query stream is used that extends embedding representations to incorporate positional information about the masked words. The original representation set (content stream), including both token and positional embeddings, is then used as input to the query stream following a scheme called “Two-Stream SelfAttention”. To overcome the problem of slow convergence, the authors propose the prediction of the last token in the permutation phase, instead of predicting the entire sequence. Finally, XLnet uses also a special token for the classification and separation of the input sequence, [CLS] and [SEP], respectively; however, it also learns an embedding that denotes whether the two words are from the same segment. This is similar to relative positional encodings introduced in TrasformerXL [
15], and extents the ability of XLnet to cope with tasks that encompass arbitrary input segments. Recently, a replication study [
59], suggested several modifications in the training procedure of BERT which outperforms the original XLNet architecture on several NLP tasks. The optimized model, called robustly optimized BERT approach (RoBERTa), used 10 times more data (160 GB compared with the 16 GB originally exploited), and is trained with far more epochs than the BERT model (500 K vs. 100 K), using also 8 times larger batch sizes, and a byte-level BPE vocabulary instead of the character-level vocabulary that was previously utilized. Another significant modification was the dynamic masking technique instead of the single static mask used in BERT. In addition, RoBERTa model removes the next sentence prediction objective used in BERT, following advises by several other studies that question the NSP loss term [
44,
55,
101].