Financial practitioners often pay attention to economic-related news because they can learn stock trends from this news. For example, the stock price in the past will reflect the past information, and the latest news will participate in the changes in the stock price. Therefore, financial practitioners need to obtain positive or negative information from the latest news on time to make decisions. And people can analyze the information in the news through the sentiment analysis model. However, due to the unavailability of domain-specific languages and large-scale label datasets, financial sentiment analysis is challenging. MISHEV [
8] and his team conducted comprehensive research on NLP-based financial sentiment analysis methods. This research covers multiple natural language processing methods, ranging from dictionary-based methods, word and sentence encoders, and transformer models. Compared with other evaluation methods, the transformer shows excellent performance. The text expression method is the main advancement in the accuracy of sentiment analysis. This method inputs the semantics of words and sentences into the model.
Kaliyar et al. [
9] studied the bi-directional model BERT. Compared with other word embedding models, BERT is based on the bi-directional idea. It uses a transformer encoder-based architecture to calculate word embedding. Although compared with the BiLSTM model on the transmission encoder, BERT is a powerful feature extractor. But in a larger corpus, BERT has longer training and inference time. It also contains large memory requirements. By designing a fine-tuned BERT model for future research, these practical problems can be alleviated. For small datasets, the performance improvement of BERT will be more noticeable. It shows that the use of pre-trained networks like BERT may be critical to achieving performance in such context-related tasks.
As mentioned earlier, the BERT model has a very good performance in natural language processing tasks. However, in actual tasks, the BERT model requires a lot of computing power. Sun et al. [
10] proposed a patient knowledge distillation method, which can compress a large-scale BERT model into a smaller-scale model. Insufficient problems in the calculation of the scale model can be solved by such means. The Patient-KD method introduced in their work achieves multi-layer distillation, enabling the student model to comprehensively absorb the knowledge embedded within the teacher network model. They substantiated the model’s efficacy by subjecting it to a battery of natural language processing tasks, thereby validating its effectiveness and utility.
In the initial phase, pre-training models have gained substantial traction in the realm of natural language processing tasks. However, the extensive adoption of large-scale models has also brought about challenges related to real-time processing and computational constraints. Addressing these concerns, Sanh et al. [
11] introduced DistilBERT, an enhanced iteration of the BERT model. DistilBERT features reduced parameters, expedites training, and preserves model performance. Their work demonstrated the viability of training a universal language model through distillation and conducted in-depth analysis of different components via ablation studies.
Bi-directional attention learning can greatly help self-attention networks (SAN), such as the BERT and XLNet models. Song et al. [
12] proposed a pre-training scheme “Speech-XLNet” similar to XLNet for unsupervised acoustic model pre-training to learn the voice representation of SAN. The parameters of the pre-trained SAN model were adjusted and updated under the hybrid SAN/HMM framework. They speculate that by shuffling the sequence of speech frames, the permutation in Speech-XLNet can be used as a powerful regularization function to make the SAN model use its attention weight method to focus on the global structure. In addition, the Speech-XLNet model can perform speech representation learning by exploring the context. Various experiments show that Speech-XLNet is better than the XLNet model in training efficiency and performance.
Effectively identifying trends in human emotions in social media can play an important role in maintaining personal emotional health and collective social health. Alshahrani et al. [
13] fine-tuned the XLNet model to predict the sentiment of Twitter messages at the personal message level and user level. Since the XLNet model can collectively capture the context and use multi-head attention to calculate the context representation, their method greatly improves the technical level of the benchmark dataset. In addition, using deep consensus algorithms, they can significantly improve accuracy.
Compared with static word embedding, the word embedding method represented by context performs better in many NLP tasks. For example, how is the contextual representation model generated by the BERT model generated? Through research, Ethayarajh et al. [
14] learned how words will be represented in a natural context. Initially, their investigation revealed that the uppermost layer of the BERT model and analogous models generated notably more context-specific representations compared to the lower layers. This heightened level of contextual specificity consistently coincided with an increased degree of anisotropy.
Klein and Moin [
15] proposed a simple and effective problem generation method. They bundle GPT-2 and BERT and then use an end-to-end trainable approach to promote semi-supervised learning. The problems generated by this method are of high quality and have a higher degree of semantic similarity. In addition, the experiments performed show that the proposed method allows problems to be generated and greatly reduces the burden of complete annotations. The word embedding in a two-way context makes the pre-trained BERT perform better in question answering tasks. In addition, since BLEU and similar scores are weak metrics for evaluating generation ability, they recommend using BertQA as an alternative metric for evaluating the quality of problem generation.
The BERT model performs very well on many NLP tasks and it not only has an English version but also many other voice versions. The study found that the BERT model trained by a single voice is better than the BERT model trained by multiple languages. Therefore, training the BERT model of a specific speech has a great effect on the natural language processing task of a specific language. De Lobel et al. [
16] proposed a new Dutch model based on RoBERTa, called Robbert. And through different NLP tasks, it is proved that its performance is better than other language models based on the BERT model. And, they found that Robbert’s model performs better when dealing with smaller datasets.
Chernyavskiy et al. [
17] proposed a system specially developed for SemEval-2020 Task 11 in a news article. The model they proposed is based on the architecture of RoBERTa, and then the final model is completed through integrated learning of the model after subtask training.
In work [
18], Polignano et al. proposed an Italian language understanding model, ALBERTo. This model is used for training with hundreds of millions of Italian tweets. After training, the model is fine-tuned on a specific Italian task, and the final result is better than other models.
The research of Moradshahi et al. [
19] shows that different NLP tasks cannot transfer knowledge through the BERT model. So they proposed HUBERT, a modified version of the BERT model. This model separates the symbols from the roles in the BERT representation by adding a decomposition layer on the BERT model. The HUBERT architecture utilizes tensor product notation, where the notation of each word is constructed by binding two separate attributes together. In extensive empirical research, HUBERT has shown continuous improvement in knowledge transfer across various language tasks.
Wu et al. [
20] proposed two methods to identify offensive language behaviors in social media. First, they use supervised classification. Second, they use different pre-training methods to generate different data. In addition, they did good preprocessing work, and they translated the emoji into words. Then, they use the BERT model for identification.
Gao et al. [
21] study the feature engineering model, based on the related work in the embedded neural network, and try to use the BERT model with deep neural networks. Then, they proposed TD-BERT models with different forms. In different NLP tasks, they compared its performance with other methods. The results show that the TD-BERT model performs best. Experiments show that the complex neural network used to bring good performance through embedding does not match the BERT and incorporating target information into the BERT can stably improve performance.
In this work, González-Carvajal et al. [
22] compared the BERT model with traditional machine learning methods in many aspects. The traditional machine learning NLP method uses the TF-IDF algorithm to train the model. The article compares and analyzes the text analysis experiments of the four methods. In all these classifications, we use two different classifiers: BERT and traditional classifiers.
Baruah et al. [
23] use classifiers based on BERT, RoBERTa, and SVM to detect aggressiveness in English, Hindi, and Bengali texts. Their SVM classifier performed very well on the test set, with 3 out of 6 tests ranking second in the official results and fourth in the other. However, through more careful analysis, it can be seen that the SVM classifier performs better because the SVM model has a better classification effect. It is found that the BERT-based classifier can better predict minority groups. It was also discovered that their classifier did not correctly handle spelling errors and deliberate spelling errors. FastText word embedding works better when dealing with orthographic changes.
Lee et al. [
24] trained the Korean version of the BERT model, KR-BERT, by using a small corpus. Due to the particularity of Korean and the lack of corpus, it is also very important to use the BERT model for language representation. For this reason, they compared different tokenizers and gradually narrowed the minimum range of tokens to build a better vocabulary for their model. After these modifications, the KR-BERT model they proposed can achieve better results even with a small corpus.
In this paper [
25], Li et al. compared the BERT and XLNet models, especially from the comparison of the computational characteristics of the two. Through comparison, they found two points. The first is that the two models have similar computational characteristics. The second is that the XLNet model has a relative position encoding function. On modern CPUs, they have 1.2 times the arithmetic operation and 1.5 times the execution time. At this cost, a better benchmark score was obtained.
As multiple geographic locations are involved, the data is inherently multilingual, leading to frequent code-mixing. Sentiment analysis of the code-mixed text can provide insights into popular trends in different regions, but it is a challenge due to the non-trivial nature of inferring the semantics of such data. In this paper [
26], the author use the XLNet framework to solve these challenges. They used the available data to fine-tune the pre-trained XLNet model without any other pre-processing mechanisms.
Ekta et al. [
27] proposes a method for studying machine reading comprehension. This method uses eye-tracking data for training and studies the connection between visual attention and neural attention. However they show that this connection does not apply to the XLNet model, although XLNet performs best in this difficult task. Their results show that the neural attention strategies learned by different architectures seem to be very different, and the similarity of neural and human attention does not guarantee optimal performance.
Natural speech processing technology has been widely used in real life. But models such as BERT and RoBERTa need to consume a lot of computing resources. Iandola et al. [
28] found that the grouped convolution method improves the efficiency of the computer vision network, so they used this technology in SqueezeBERT. Experiments show that its training speed is 4.3 times faster than BERT.
The BERT model has a good performance in several NLP tasks. However, its performance in certain professional fields is limited. Therefore, Chalkidis et al. [
29] found that the application of the BERT model in the professional field requires the following steps: use the additional pre-training of the specific domain corpus to adjust the BERT model or pre-train the BERT model from scratch on the specific domain corpus.
Lee et al. [
30] uses the BERT model to implement word embedding. Then the text processing is performed by integrating the two-way LSTM model and the attention mechanism. The accuracy of such an integrated method can reach 0.84.
Bashmal et al. [
31] also used an ensemble learning method based on the BERT model. After preprocessing Arabic Tweets, they encode emoticons. Then through the integration of the BERT model and an improved BERT model for processing sentences, a high accuracy rate was finally obtained.
The transformer model has achieved great results in many NLP tasks. However, the transformer model has many parameters and requires a lot of space and computing resources. So how to add a smaller and faster model has become a problem. Nagarajan et al. [
32] proposed a new method to reduce the size of the transformer model. The approximate calculations to use simple computing resources and reduce the use of some unimportant weights. Doing so allows the model to gain faster speed with only a loss of accuracy.
There are generally two methods of normalization in neural networks, layer normalization and batch normalization. Shen et al. [
33] described why the transformer model uses layer normalization. Later, they proposed a power normalization method, which achieved better results.
While the transformer model has demonstrated proficiency in addressing numerous natural language processing challenges, fine-tuning the model remains an intricate endeavor. In their work, Li et al. [
34] introduced a visualization framework aimed at providing researchers with an intuitive means of obtaining feedback during parameter adjustments. This framework enhances clarity during the model’s fine-tuning phase by offering researchers a more transparent view of its behavior and performance.
The BERT model based on the transformer model is also applied in the medical field. Electronic health records are often combined with deep learning models to predict patient conditions. Inspired by this, Rasmy et al. [
35] proposed the Med-BERT model, a pre-training model trained through patient electronic health record data. In the experiment, it was found that the Med-BERT model has a higher accuracy rate in predicting patients’ clinical tasks.
With the development of the Internet, it has become easier for people to obtain news and information, and there are more and more false information and false news. As a consequence, Schütz et al. [
36] harnessed multiple pre-trained transformer-based models for the purpose of identifying fake news. Their empirical findings underscore the robust capability of transformer models in effectively discerning and detecting fake news.
As the Internet continues to evolve, the prevalence of social media platforms has surged, with a substantial portion of content comprising satire. Identifying satirical language poses a unique challenge due to its distinctive nature. In response, Potamias et al. [
37] introduced an approach that amalgamates a recurrent neural network and a transformer model to discern satirical language. Empirical results from their study highlight the enhanced performance of their proposed model when applied to the dataset.
The BERT model released by Google is trained on the English corpus. If you want to apply the BERT model to other models, you need to use corpora of other languages to train the model. Souza et al. [
38] used the Spanish corpus to train the BERT model and got good results in the test on downstream tasks. They called the trained model BERTimbau.
The BERT model based on the transformer model has a good performance on many NLP tasks. González-Carvajal et al. [
39] described why the BERT model performs better than traditional machine learning methods on natural language processing tasks. Describe the superiority of BERT through different experiments.
The BERT model is a pre-trained model based on the transformer model, while the ALBERT model is a lightweight BERT model. Choi et al. [
40] compared the BERT model and the ALBERT model, and then proposed an improved version of the BERT and ALBERT model, the Sentence-BERT model, and the Sentence-ALBERT. Through experimental reality, the proposed model has better performance than BERT and ALBERT.
Koutsikakis et al. [
41] used Greek predictions to train the BERT model, and obtained a GREEK-BERT model suitable for Greek NLP tasks. And in the task test of natural language processing. They found that the single-language GREEK-BERT model they trained is better than the M-BERT model and XLM-R model that are suitable for multiple languages. In their research, Hall et al. [
42] conducted an extensive review of NLP models and their applications in the context of COVID-19 research. Their focus was primarily on transformer-based biomedical pretrained language models (T-BPLMs) and the sentiment analysis related to COVID-19 vaccination. The comprehensive review encompassed an analysis of 27 papers, revealing that T-BPLM BioLinkBERT exhibited strong performance on the BLURB benchmark, which involves the integration of document link knowledge and hyperlinking into the pretraining process. Furthermore, the study delved into sentiment analysis, leveraging various Twitter API tools. These analyses consistently depicted a positive sentiment among the general public regarding vaccination efforts against COVID-19. The paper also thoughtfully outlines the limitations encountered during the research and suggests potential avenues for future investigations aimed at enhancing the utilization of T-BPLMs in various NLP tasks related to the pandemic.
Casola et al. [
43] conducted an extensive study on the increasingly popular pre-trained transformers within the NLP community. While these models have showcased remarkable performance across various NLP tasks, their fine-tuning process poses challenges due to a multitude of hyper-parameters. This complexity often complicates model selection and the accurate assessment of experimental outcomes. The authors commence by introducing and detailing five prominent transformer models, along with their typical applications in prior literature, with a keen focus on issues related to reproducibility. One noteworthy observation was the limited reporting of multiple runs, standard deviation, or statistical significance in recent NLP papers. This shortfall could potentially hinder the replicability and reproducibility of research findings. To address these concerns, the authors conducted an extensive array of NLP tasks, systematically comparing the performance of these models under controlled conditions. Their analysis brought to light the profound impact of hyper-parameters and initial seeds on model results, highlighting the models’ relative fragility. In sum, this study underscores the critical importance of transparently reporting experimental details and advocates for more comprehensive and standardized evaluations of pre-trained transformers in the NLP domain.
In a separate vein, Friedman et al. [
44] introduce a transformer-based NLP architecture designed to extract qualitative causal relationships from unstructured text. They underscore the significance of capturing diverse causal relations for cognitive systems operating across various domains, ranging from scientific discovery to social science. Their paper presents an innovative joint extraction approach encompassing variables, qualitative causal relationships, qualifiers, magnitudes, and word senses, all of which are instrumental in localizing each extracted node within a comprehensive ontology. The authors demonstrate their approach’s effectiveness by presenting promising outcomes in two distinct use cases involving textual inputs from academic publications, news articles, and social media.
In the realm of actuarial classification and regression tasks, Troxler et al. [
45] delve into the utilization of transformer-based models to integrate text data effectively. They offer compelling case studies involving datasets comprising car accident descriptions and concise property insurance claims descriptions. These case studies underscore the potency of transfer learning and the advantages associated with domain-specific pre-training and task-specific fine-tuning. Moreover, the paper explores unsupervised techniques, including extractive question answering and zero-shot classification, shedding light on their potential applications. Overall, the results eloquently demonstrate that transformer models can seamlessly incorporate text features into actuarial tasks with minimal preprocessing and fine-tuning requirements.
Singh and Mahmood [
46] offer a comprehensive overview of the current landscape of state-of-the-art NLP models employed across various NLP tasks to achieve optimal performance and efficiency. While acknowledging the remarkable success of NLP models like BERT and GPT in linguistic and semantic tasks, the authors underscore the significant computational costs associated with these models. To mitigate these computational challenges, recent NLP architectures have strategically incorporated techniques such as transfer learning, pruning, quantization, and knowledge distillation. These approaches have enabled the development of more compact model sizes, which, remarkably, yield nearly comparable performance to their larger counterparts. Additionally, the paper delves into the emergence of Knowledge Retrievers, a critical development for efficient knowledge extraction from vast corpora. The authors also explore ongoing research efforts aimed at enhancing inference capabilities for longer input sequences. In sum, this paper provides a comprehensive synthesis of current NLP research, encompassing diverse architectural approaches, a taxonomy of NLP designs, comparative evaluations, and insightful glimpses into the future directions of the field.
In a separate domain, Khare et al. [
47] present an innovative application of transformer models for predicting the thermal stability of collagen triple helices directly from their primary amino acid sequences. Their work involves a comparative analysis between a small transformer model trained from scratch and a fine-tuned large pretrained transformer model, ProtBERT. Interestingly, both models achieve comparable R2 values when tested on the dataset. However, the small transformer model stands out by requiring significantly fewer parameters. The authors also validate their models against recently published sequences, revealing that ProtBERT surpasses the performance of the small transformer model. This study marks a pioneering endeavor in demonstrating the utility of transformer models in handling small datasets and predicting specific biophysical properties. It serves as a promising stepping stone for the broader application of transformer models to address various biophysical challenges.
Discussion
One year after the transformer model was proposed, the BERT model of the encoder part using the transformer model has gradually become familiar and applied to many NLP tasks. Although the BERT model performs well in various NLP tasks, it is computationally intensive and takes a long time. Therefore, paper [
10,
11] proposed a method of knowledge distillation to compress the capacity of the model. Two methods are proposed in paper [
10], one is the student model learning k layers from the teacher model, and another one is that learning from every k layer from the teacher model. In the methodology described in [
11], the approach leverages the shared dimensionality between teacher and student networks. It involves the initialization of the student network from the teacher network by selectively taking one layer out of every two layers in the model.
Not only the encoder part of the transformer is widely used in NLP tasks, but the GPT model of the decoder part based on the transformer model also performs well in NLP tasks. In addition, the RoBERTa model based on the BERT model and the XLNet model, which improves the BERT model, also has good performance. Paper [
8,
14,
22,
23] compares several models. Among them, paper [
8] has two contributions. The first is the use of models for sentiment analysis of financial news. There has been very little such work before. The second point is to conduct a lot of comparison experiments, using a lot of different text representation methods and machine-learning classifiers for comparison. In paper [
14], Geometry of BERT, ELMo, and GPT-2 embeddings are mainly compared. By analyzing the vectors corresponding to words in different layers, we understand the different embedding representations of the three models. In the article [
22], the author compared the classification performance of traditional machine learning that uses vocabulary extracted from a TF-IDF model and the BERT model through several experiments. The empirical evidence of the BERT model’s superiority in average NLP problems classical methodologies have been added through four experiments. In the article [
23], the author compared BERT, RoBERTa, and SVM models in three languages. Interestingly, the best performing model in this article is SVM, which shows that the performance of traditional machine learning methods can also surpass the transformer model. In this article, we also discovered the importance of data preprocessing, because the spelling of words will cause errors in word embedding, which will lead to incorrect predictions.
Only using a single model such as BERT or XLNet cannot solve some problems, so paper [
13,
15,
21] proposed some solutions combining transformer models with other machine learning methods. In paper [
13], they used the XLNet model combined with deep consensus for sentiment analysis. This combination can better improve the accuracy of the model. In paper [
15], the article studies the generation and answering of questions. They use the GPT-2 model to combine with the BERT model. This combination makes better use of collaborative question generation and question answering. In paper [
17], they use RoBERTa as the main model. But at the same time, additional CRF layers were added, and training was performed on two tasks. The results show that this combination is better than just using RoBERTa. In the article [
21], the author proposes the TD-BERT model, which is similar to the model in which a fully connected network is added to the BERT network for classification. The difference is that TD-BERT adds a maximum pooling layer after BERT, which allows the classifier to make better use of location information.
The BERT model uses a lot of English corpora for training and has a very large English corpus. But for NLP tasks in other languages, the BERT model is not competent. In paper [
16,
18,
24,
26], other language models based on the BERT model have been established. Among them, paper [
16] trained a large number of Dutch corpus to obtain a RobBERT model based on Dutch, and paper [
18] established an ALBERTo model for Italian NLP tasks. These two models perform very well in the NLP task of the corresponding language. Since Korean is one of the rich languages that use non-Latin alphabets and lack resources, it is also important to capture language-specific linguistic phenomena. In the article [
24], the author proposed a KR-BERT model for Korean NLP tasks. This model uses a smaller corpus for training, which makes the training time of the model shorter and more efficient. People use a lot of informal languages when using social media. A lot of code-mixed languages will be produced, such as mixing two languages. Such sentences will be a big obstacle to sentiment analysis. In the article [
26], the XLNet model was used to solve such problems, but it did not perform well. I think that for such code-mixed languages, the corpus is no longer working, so you can try to use traditional machine learning methods or use other data preprocessing methods. The attention mechanism is a very important part of the transformer model. Is such an attention mechanism similar to the human attention mechanism? The paper [
27] gave the answer. They found that the higher similarity of human attention and performance is significantly related to the LSTM and CNN models, but it is not true for the XLNet model. The XLNet achieved the best performance, which shows that similar to human attention does not guarantee the best performance. It also shows that the machine can only think in a more advanced way.
Iandola et al. [
28] with their SqueezeBERT approach and Nagarajan et al. [
32] with their use of approximate computing and pruning. The fine-tuning and pre-training of the BERT model for specific tasks or domains include papers such as Chalkidis et al. [
29] with their domain-specific pre-training approach and Rasmy et al. [
35] with their work on the medical text. The application of the BERT model in various tasks such as sentiment analysis, fake news detection, and satire detection. Lee et al. [
30] use of bidirectional LSTM with attention, Bashmal et al. [
31] work on Arabic sentiment analysis, Schütz et al. [
36] use a transformer-based approach to fake news detection, and Potamias et al. [
37] use recurrent and convolutional neural networks for satire detection. The following papers are focused on comparative analysis and model development, including papers such as Souza et al. [
38] with their BERTimbau approach for Brazilian Portuguese, González-Carvajal et al. [
39] with their comparison of BERT to traditional machine learning models, Choi et al. [
40] with their comparative study of BERT variants, and Koutsikakis et al. [
41] with their work on GREEK-BERT for Greek NLP tasks.
Hall et al. [
42] and Casola et al. [
43] review the use of transformer-based biomedical pretrained language models in COVID-19 research and the importance of reporting experimental details and standardization for reproducibility. Friedman et al. [
44] present a joint extraction approach to extracting qualitative causal relationships from unstructured text, which has important implications for cognitive systems and graph-based reasoning. Troxler et al. [
45] explore the use of transformer-based models to incorporate text data into actuarial classification and regression tasks. Singh and Mahmood [
46] provide a comprehensive overview of current NLP research, including different architectures, a taxonomy of NLP designs, comparative evaluations, and future directions in the field. Khare et al. [
47] demonstrate the potential of transformer models in predicting the thermal stability of collagen triple helices directly from their primary amino acid sequences.