nach oben

Journal of Big Data

Erschienen in:

Open Access 01.12.2022 | Research

Transforming the generative pretrained transformer into augmented business text writer

verfasst von: Faisal Khalil, Gordon Pipa

Erschienen in: Journal of Big Data | Ausgabe 1/2022

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

This study uses transformers architecture of Artificial neural networks to generate artificial business text for a given topic or theme. The implication of the study is to augment the business report writing, and general business writings process with help of generative pretrained transformers (generative pretrained transformer (GPT)) networks. Main focus of study is to provide practical use case for GPTs models with help of big data. Our study model has 355 million model parameters and trained for three months on GPU enable devices using 2.3 billion text tokens(is available as open-source data now). Text tokens are collected with help of rigorous preprocessing, which includes; shortlisting of Subreddits of Fortune 500 companies and industries, listed on US-based social news aggregation online portal called “Reddit”. After shortlisting, millions of submission of users during the five years, are parsed to collect the URLs out of it. 1.8 million working URLs are scrutinized. Business text is parsed, cleaned, and converted into word embeddings out of uniform resoruce locator (URLs). The result shows that both models; conditional interactive and random sampling, generate text paragraphs that are grammatically accurate and stick to the given topic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

NLP

Natural language processing

GPT

Generative Pretrained Transformer

URLs

Uniform resoruce locator

ANNS

Artificial neural networks

LSTM

Long short term memory

Machine translation

GED

Grammar development environment

CSR

Continuous speech recognition

LVCSR

Long vocabulary speech recognition

HTML

Hyper Text Mark Langauge

Introduction

With the passage of time, the field of artificial intelligence, and machine learning have been made progress by leaps and bounds. Nearly all fields are getting benefits from the cutting-edge technologies to leverage their processes, and Deep learning is one of them. Big tech giants are reformulating their strategies to align with AI and ML. Deep learning is a branch of Machine learning that enhances the model learning process with its deep layered architecture. Like many other walks of life, Deep learning has won its spurs as a very effective and efficient technique for natural language processing related tasks. Since, computers are unable to understand the natural language, enabling them to understand the natural language and to process the information in a useful fashion has long been under the researchers’ and practitioners’ focus.

This study is inspired by the new method implement by the Google Brain team [47] and the work of OpenAI [36]. Before introducing the transformers implement by the above-cited research work, it is important to shed the light on the recent past of Natural language processing (NLP). Although Natural language Processing (NLP) has deep roots in the past and the first breakthrough was the well-known paper of Alan Turing ‘Computing Machinery and Intelligence’ [46], real progress in the field has been made in the late 1980s—when machine learning algorithms came into the picture. The machine learning revolution has permanently changed the approaches to address NLP related problems. At the start, mostly much stress has been given to rich text features embedding—to enables Artificial Neural Networks (ANNS) to understand the rich text in numerical form. Later these embeddings are given to an end-to-end neural network that essentially maps the input and output, i.e [32]. Later one, seminal work published related recurrent neural network [40]. Recurrent models are very important for natural language processing because natural language caries lexical, syntactical, and semantic context in it- thus previous words or characters are very important to solve machine translation and text prediction tasks. In the year 2002 Jürgen Schmidhuber and his students [18] came up with a better idea for neural network application that involves long-term dependencies, named, Long Short Term Memory (LSTM). Long Short Term Memory (LSTM) devises some gating and sates mechanism that keeps import information from the previous sequence and also memories the previous state that finally accumulates to the current state to predict the next sequence. Many enhancements have been made by the research community in the recurrent neural network model. The most highlighted models are seq2seq (sequence to sequence) [24, 44]. Seq2seq models essentially work with encoders and decoders recurrently to encode the output of the previous sequence and combine it with the current input. The next enhancement in recurrent model is attention mechanism, see [55, 56]. Attention mechanism has been proven very well in machine translations, where two pairs of sentences of two languages are mapped together with encoders and decoders.

So, looking back to the short history of the evolution of the natural language processing techniques, we understood one common limitation of all these models concerning solving the NLP task is the models are computational resources hungry and very slow. NLP corpus normally involves an enormous amount of training data, long-term dependencies, and recurrent nature. These factors make the training process very slow to achieve the desired result. Addressing this problem, the research community has come up with multilayered attention head and encoder decoders—formally called Transformers [47]. The current study uses a similar approach to generate the domain specific text, and detailed methodology is discussed in "Methods". We have used a recently developed transformer neural network architecture. This architecture is primarily used for Google translation works in two different blocks, namely, encoders and decoders. We have only used the decoder part. We have provided the model with a 2.3 billion text token during the training. The model has 355 model parameters and has been trained for 3 months to reach a 2.6 training loss value. Above-mentioned 2.3 billion text tokens are collected after rigorous data preprocessing steps. US-based social news aggregation and discussion forum has been selected for data collection purpose. Almost 700 Subreddits are shortlisted for the purpose of getting URLs out of it. Millions of submissions for five years have been considered. Submission means any post, comment, or reply by the user. Users often redirect towards URLs for clarification. So, 1.8 million URLs are collected from the submissions, and validation and functionality of all URLs have been confirmed. With the help of a parser, these URLs are parsed and cleaned to get the text. Finally, 2.3 billion ready to feed to the model word embedding has been generated. In rest of the paper; literature review, Methodology of the study and model, results of the study and limitation and future suggestion have been given respectively.

Research gap

After getting the flashback of the evolution of the NLP and recent developments of NLP, we can see one common problem for all Natural language understanding problems is creating a relationship matrix between the words or characters and giving importance to the specific word at a specific place. Solving this problem is very important for all NLP-related niches, for example, Natural language understanding, Natural language generation and, machine translation. In this connection, we have mainly two problems to be solved. Problem no 1 is again giving importance to the words and specific place in the sentence and creating correlation or context to each word embedding based on their usage. The second very problem is supplying a lot of data or in other words a lot of instances to the model to learn the placement and relational pattern of the characters or words. Giving a lot of data needs a lot of words’ embeddings matrix that leads to extremely slow model training and a lot of computation resources. So, the computational and efficiency problem is more lethal as it seems to get a breakthrough of problem No. 1. The research community either could wait for the computation resources to get more efficient and faster enough to solve the problem at hand, or they must have to come up with an optimal solution. So, the solution to this problem was attention mechanism [47] and most specifically transformer architecture of neural networks, formally called encoders decoders [36]. well, fair enough transformer can, theoretically, overcome the above-mentioned problems and give a new horizon to the landscape of NLP and NLG, but we need to provide a lot of real-life use cases and proof of concept to supplement this new ANNS architecture. After this conceptual breakthrough, the next challenge is to come up with a lot of data and preprocess that much big data to supply it to these new models to proof the concept of the conceptional invention. Our paper is exactly filling this gap here by coping with the challenge of developing the proof of concept and practicality of this new advancement of NLP and deep learning. So, in this journey the most important step is to find a use case; so, we have chosen business-related reports and text writing. In the next subsection, we will give precise details where and how this concept can be used in a commercial setting and what benefit it can promise. Coming back to the current point, getting a lot of business-related data is very important as well very hard because of a lot of irrelevant text and without the authenticity of being business text. So, involving humanized efforts to tag data is very costly and not plausible. So, we decided to use “reddit” a platform, widely used, and each post is voted by the community. In this way, we could get human checked data in huge volume, related to the business problems. it is also relevant to mention here that we did not parse data from “reddit” directly, rather we have only collected URL links from the posts, and then we parse complete URLs text. So, our main contribution here is rather less on the theoretical side and more on the practical side. As we have retuned and adopted the existing theoretical concept in a more practical setting to provide its proof of concept. after having this discussion, it’s very relevant to provide one hypothetical application instance and possible commercial usage of this study. So, next subsection talks about the hypothetical ideal use case and overall generic use cases of the study.

Hypothetical use case

Let’s here create a practical scenario. In the office and business management, there are a lot of reports and text writing, for example, Manager X has to give a job placement ad for a consultancy firm, or, he has to write an advertisement. He has to write a small report about his product and its competitor in the industry he is operating to get external funding. In such cases the grammar is not only an important factor but are pinning words other people are using in the industry to influence more or clarity of text is maybe more important. Let say a software application helps Manager X in two ways; first, gives a context or appropriate usage of words replacement based on millions of other use cases already people used in similar instances. Second, if he writes “Apple Inc.”. the application suggests him, i.e., “Apple has launched iPhone pro max. in 2020 that gave them xxx hundred thousand $ annual revenue”. So, now Manager X can save a lot of time and energy in surfing google in searching facts and figures. if some assistance is provided on how he can paraphrase any keywords, could improve business writing greatly. I know that requires a lot of work on front-end development too, but the Black box part would be NLG here.

Practical implication

The study has great potential for real-world practical uses: for example, next-word prediction, topic modeling to extract text out of scanned images, contextual soundness of the business writing, and suitability of word usage even if it’s grammatically correct in the first place. Any subject-specific knowledge, language usage, and vocabulary are always different compares to generic languages. Many companies and start-ups have software applications that are using a similar approach but use general language text. Here is a list of some: Gmail salutation and common words autofill used during the email [20], Grammarly [21] gives words context suggestion and content clarity based on the text they have trained upon. At the start of registration, they asked for purpose of use. Maybe something like Grammarly business writer or something similar could be the very practical use of this study. Reverso Translator gives translation based on the frequency of usage of the word in literature along with text, except where the looked-up words have been used. There is the potential of usage of such tool is there where one can give the accurate context of the only business-related text. Lastly, we did know at the time of conducting this research, but one online platform emerges now which is using augmented writing approach with greater success having a top-level firm in their customers’ portfolio, i.e. see [54]. This would be a very true practical usage of such a study. There is not only business related application of language generation model but also applied to many filed. i.e, van Deursen [15] introduced Generative Examination Networks (GEN) to generated chemical space. 5].

How deep learning integrate into corporate sector?

The literature on the Natural Language Processing is root back in the 1940s. After parsing the literature, the evolution of NLP can be segregated into different phases; for example, the journey started from machine translation problems, followed by the computers and information technology revolution—that triggered the AI applications into this area. After AI and machine learning came into the picture—complex task solving ability has been improved with less time—thus grammatical structure has been focus more. After advancements like deep learning and reinforcement learning, NLP has now entered into artificial text generation and generated text is hardly differentiates from human written.

Though the research community of that time had been working on NLP, the first scientific paper was published by the MIT language department head, William. N Locke and A.Donald Booth, head of the Brick-Beck collage [28]. Machine Translation (Machine Translation (MT)) started with three dominant languages of that time, English, Russian, and a bit of Chinese. Computational resources were too scarce and much effort had to be exerted on converting data in bits [1]. Early birds in this area have given focus to syntactical computational processing of language, and it was important to first draw the basic structure for the language [35]. Work of [11] some researchers have tried to shift the focus from the syntactical to semantic oriented language processing. Ceccato tried to co-relational analysis between the same pattern of a pair of languages and tried to achieve the semantic driven language processing. Winograd [52] and Woods [53] have seen the 1960s transformational grammar theory is a misfit of computational grammar and analysis and not offering much in terms of semantics. The computational confidence approach is given by Woods’ and Winograd’s enriched the previous work in a semantic path.

Later on, in the 80s, AI came into the picture and the community has shifted their focus toward a machine leaning based approach for solving the existing dilemmas of NLP in a pure semantics way [41]. In this decade, researchers have realized that the NLP task such as building the word representation to use in AI-related networks and pining the context is very hard. Some note able work of the 1980s is as follows: Briscoe et al. [9] have built a general-purpose grammatical formalism including syntactical analyzer for the English language with help of suboptimal software, named Grammar Development environment (Grammar Development environment (GED)). They also program software to build and manage a large grammar base. Towards the direction of speech recognition, Young et al. [57] have led to major US speech recognition projects, called, Continuous speech recognition (Continuous speech recognition (CSR)) and (Long vocabulary speech recognition (LVCSR)). The paper includes tools and methods for news transcription, text dictation, and transcriptions.

The next phase of the NLP development is the 1990s, that mostly focuses on a combination of lexical and syntactical approach for natural language processing. After lot of twists and struggle of almost two decades, the statistical and probabilistic approach has been adopted for classification tasks in NLP [43]. Later on, these models became raw sources of machine learning related techniques to solve the NLP complexities. for example, Manning and Schuetze [29] have worked on information retrieval, feature extraction out of it, and analyzing the textual information with statistical models. Mani and Maybury [30] have used terminological logic to built a knowledge base for automatic information extraction and text summarizing. By the end of the 1990s, dialogue speech system and language processing had expanded the horizon with multilingual text machine translations, speaker-independent speech to speech dialogue system. Wahlster [50] has worked on project Foundation of Speech-to-Speech Translation—so-called, ‘Verbmobil’. This multilingual (German, English, and Japanese) takes input in a speaker-independent manner and translates them into other desired languages. it also handles domain-specific business spoken dialogues and translates into other languages with approximately 80 percent accuracy. The struggle of many years make the NLP researchers, practitioner, and industry realize that linguistic resources are inevitable for the further development in this filed, thus, two institutions, “British National Corpus” [8] and “WordNet” [17] are come into being. The next era of natural language processing started after 2001. Though many models have been proposed by the researchers which were other than neural networks, we are only discussing the neural network-oriented important models in this paper.

Bengio et al. [7] proposed tri-gram state-of-the-art neural probabilistic model. They have used a neural network for the probability function. The idea is based on the conjecture that unseen words get a higher probability to be predicted based on the similarity of the words—on which the network is trained. The next word prediction approach has many practical uses commercially, for example, see the work of [26] that can generate a small short semantic reply of the email.

The next advancement in the field of NLP is multitask learning, off-course this method is not only confined to the NLP but a general enhancement in the neural network world. Collobert and Weston [12] have tried to implement this technique for transfer learning. Vector representations of the words have been fed as an input to the model to do word prediction and then learning of the current model was transferred to the other independent model to achieve a similar but not the same task. The multi-task learning approach was first introduced by the Caruana [10]. Once, so-called, word vector representations are fed to the neural network, they start learning the context and association of each work with the other. Transfer learning makes it possible to share the learned weight across the models for generalization and incremented learning approach. During the optimization process, it is very important which parameter to transfer. Ruder [39] proposed that the sharing parameter can also be learned during the learning process. See also similar research [31]. In this connection, the next milestone was “vectors representation” of the text, so-called word embeddings. This basic word embedding idea was first floated by mikolov [33]. They have proposed that removing the hidden layer while training the word embedding is giving more promising outcomes. Later on, this idea paved the way for the concept ‘word2vec’ and originally adapted to two popular approaches, namely, bags-of-words and skip grams. This phenomenon has triggered the research interest in this direction and many researchers have enrich this concept see [2, 3, 34, 51]. The current direction of the word embedding is to train a very large corpus and use used pre-trained embeddings for multilingual models in an independent and unsupervised fashion. for example, see [4, 13, 42].

In the year 2013 and 2014 neural network architectures are being applied to NLP, the most obvious choice was recurrent, recursive, and convolutional neural networks. simper Elman [16] RNNs were replaced with LSTM by [23] because of long-term context dependencies in input text. secondly, convolutional networks are originally dealt with computer vision areas but also implemented in NLP for example see the work of [25, 27]. The obvious plus of the using convolutional network is they are more parallel and local context based on layers rather than past state contrary to the LSTMs.

Concerning recurrent neural networks, the next enhancement was a sequence to sequence modeling (seq2seq). Seq2seq model is using the same recurrent architecture of the neural networks, but the important bit is disguise in encoding and decoding procedures. The input sentence is first encoded into a vector representation. The decoder then tries to decode the predicted symbols based on the encoder state sequentially. The sequence to sequence model was proposed by Sutskever et al. [44]. Later on, in the year 2016 Google [19] has decided to change its monolithic sentence based machine translation to complete neural network-based. Now, seq2seq models are the foundation of language generation models and further developments, i.e transformer-based neural network architectures. Similarly, see also image captioning [48] is using the same technique to generate the image captions automatically. The seq2seq model leads toward attention mechanism and transformers based approaches. The basic limitation of the seq2seq network is that it tries to compress the whole sequence of the sentence and then convert it into a fixed-length vector. Thus, the model cannot look into the hidden state. Attention mechanism, by contrast, looks into the hidden state of the model combine them to realize how much stress should be given to a specific word. Attention [6] was the core innovation in the field of neural machine translation that permanently replace the traditional methods of machine translation. Have a look on different flavors of attention based networks and their application; reading comprehension [22], entity parsing [49], image captioning [55].

The pretrained model has gain popularity among the NLP research community. The main advantage of the pretrained model is that it is context agnostic and unsupervised model. Labeling for the NLP task can be very costlier and challenging. So, the pretrained model captures the meaning and context of one language and the leanings can be transformed into the other language to get the meaning and context generation or translation. The pretrained model was first proposed by Dia and Le [14]. The current study is also based on pretrained multi head attention based model.

Methodology

In this section we have described how data is prepossessed and then processed data is fed to the model is discussed in detail. The completely prepossessed data will be available as an open-source data for further research and development.

Data preprocessing

In this section, we have described the process of data preparation for model training. Everything else with respect to the neural network model is similar to many other applications of ANNS, but the main concept here is to leverage the training process with an enormous amount of training data. Websites could be the potential source of a lot of textual data as well as a great deal of diversity in it, but the bottleneck with websites’ data is the validity of data and too much unnecessary information in it. Following the research by Vaswani [47] we have adopted a similar approach and choose ‘Reddit’ [37]—a USA based social news aggregation and discussion platform with 330 million users [37] to collection the website URLs to parse the data form. To ensure the validity and usefulness of the web URLs, only those links have been taken that contained more than 3 ‘karma’. ‘Karma’ is so-called assurance given by the other user about the validity of comments and discussion. In this way, we have got a human level quality check on the data. Once we have devised the mechanism of data quality, the next filer was to get the URLs that are only related to the business and Fortune 500 companies. Most of the top 500 companies have their discussion and news profile on ‘Reddit’ called ‘Subreddit’. ‘Reddit’ has a very large community and thus, thousands of submissions are committed on a daily basis. The raw data, ranging from 2005 to 2017, is first programmatically collected with help of the ‘Reddit’ programming interface [38] and stored in the ‘BigQuery’ database. In the next step, we have extracted all the URLs having ‘karma’ ranking more than 3 from the daily submission of the users. These URLs are verified, whether they are working or not and at the end 1,852,482 working URLs list was prepared to parse the textual data from ‘Hyper Text Mark Langauge (HTML)’ tags. With the help of parallel computing and a computer grid, 20 GBs of text files have been collected from all working URLs. These 20 GB text files are gain filtered for some unnecessary characters and symbols. Finally, the 2,302,554,291 text token were collected to be converted into word embeddings. The process is shown in Fig. 1a that depicts a flow of data preprocessing with help of a schematic diagram. preprocessing involves:

Methods

Next comes the transformer neural network model applied to preprocessed data. The Transformer model takes all words tokens are encoded into words embeddings, that is nothing but the numbers that represent each word. Normally, transformers have two parts, encoders and decoders, but we have only used the decoders part of the Transformer because both encoder and decoder are feasible for machine translations—that is not the case in this study. See Fig. 2 how general transformer works, originally designed for machine translation problems. This architecture was later adopted and modified by many researcher and lab to improve NLP and translation related problems. If you pay closer attention to the paper [48], you will realize transformers are also basically a from of transfer learning where sentence of the language one are pass through many layers of self-attention and feedforward neural network layers and update the training weights keeping the relationship of each word within the sentence and position of each words into mind, whereas, learned weighted of language one are transferred to feedforward layer of decoder part to learn the nature of relationship and position or grammatical aspect into mind when model tries to predict the words in the second language. That is how essence and context of sentence are translated correctly. So our case is rather different from machine translation, thus second language inputs’ weight are not possible here.So, we stick to the decoder part of the model as a main model architecture. coming back to the point of data processing, Words embedding are stored and converted into NumPy zip format for simplicity purposes. first, we will see the high-level representation of the model, and then we will look into how the self-attention layer is working. The model gets the words embedding as input, it assigns positional encoding to each word. The positional encoding keeps the position of the word into a sentence to capture the context efficiently, contrary to random order. Word embedding along with its positional information passes through the self-attention layer. The self-attention layer is twelvefold layers.

For analogy purpose, we can say this layer create many copies of the sentence and map the relationship and importance of each word in the sentence to figure out how much attention to the specific words is to be given. That is why it is called a multi-head self-attention layer. We can plunge into the self-attention layer to see how it is working. Input vector ${\mathbf {X}}_{1}.. {\mathbf {X}}_{N}$ is multiplied by three different vectors, namely, Query vector (${\mathbf {q}}_{1}$), Keys vector (${\mathbf {K}}_{1}$) and value vector(${\mathbf {V}}_{1}$). The vector is random weights of dimension 64 and the output of these matrices’ multiplication is ${\mathbf {W}}^{Q},{\mathbf {W}}^{k},{\mathbf {W}}^{v}$. In the next step, we get the dot product of (${\mathbf {q}}_{1} \cdot {\mathbf {K}}_{1}....{\mathbf {K}}_{N}$) for sentence (1....n.). To stabilize the gradient process, each output is then divided to the ($\sqrt{d_k}$), whereas, d is dimension of the vector k. This operation gives us scores for each word. higher the sores means that more attention should be given to that word. In the next step all the scores for on word related to all other words should be summed up into a variable ${\mathbf {Z}}$:

$$\begin{aligned} {\mathbf {Z}} = softmax\left( \frac{{\mathbf {Q}}\times {\mathbf {K}}^{T}}{\sqrt{d_{k}}}\right) \times {\mathbf {V}} \end{aligned}$$

(1)

This is the final calculation of one out of many self-attention layers, that is to be fed—in a matrix shape, to the feed-forward neural network. To focus on different positions of the words in the sentence we need, multiple representational subspaces, subspace is achieved with the help of multiple head or copies of the attention layer. so;

$$\begin{aligned}&{\mathbf {Q}}_{i}....{\mathbf {Q}}_{n}= {\mathbf {W}}_{i}{\mathbf {X}}\\&{\mathbf {K}}_{i}....{\mathbf {K}}_{n}= {\mathbf {W}}_{i}{\mathbf {X}}\\&{\mathbf {V}}_{i}....{\mathbf {V}}_{n}= {\mathbf {W}}_{i}{\mathbf {X}} \end{aligned}$$

(2)

whereas, i...n is the number of attention layers. ${\mathbf {Q}},{\mathbf {K}}, {\mathbf {V}}$ is the query, key, and value vector and ${\mathbf {X}}$ is the word embedding input matrix. So, every attention layer produces a ${\mathbf {Z}}$ matrix and depending on how much attention layers being chosen, in our case 12. The attention output matrices ${\mathbf {Z}}_{1}....{\mathbf {Z}}_{12}$ are multiplied with the weights’ matrix jointly for all layers, called ${\mathbf {W}}_{O}$. The resulting matrix is input for a fully connected feed-forward network. The final output of the feed-forward network is then decoded back to the words to generate the sequence of the sentence. For the clarity of the dimensions of the different matrices, please refer to Table 1.

Table 1

The table gives the dimension of the different matrices

Matrix	Dimension
$\hbox {X}_1....X_n$	Upto 512 depends on length of sentence
Every W	64
X $\times W$	DX$\times$64
Z	DX$\times$64
$\hbox {W}_o$	DX$\times$64

Results

In this section, we have described the results of our study. In this section, we have presented text samples that are generated by our trained model. The results include a sample from both conditional and unconditional samples. Conditional sampling means that we have provided a certain keyword to the model as an input and the model has returned a text paragraph related to that given keyword, however, unconditional means random samples generated by the trained model. Training loss summary of the ‘Tensorboard’ model is given in the Appendix section. To support out the accuracy of model and the sample are not appear out of chance, we have given 100 randomly generated sample by the model in the Appendix section.

We have trained the model up to 460,000 steps. Since the model has almost a 355Million model parameter and more than 2.3 billion text token, the model requires extremely excellent computation power and time. The model has been trained for 3 months on a single GPU and settles on a loss value of 2.6. This value of loss for the text-based model is quite reasonable because the language model always involves complex grammatical chains like dependencies and structures that are not easy to capture. The next two subsections provided real-time model generated text, both based on conditional and unconditional random outputs.

Interactive conditional outputs of the model

This subsection provides 5 different output samples of the interactive conditional sampling method of the study model. This is so-called interactive model outputs, in which the model communicates with the user. The user gives input/keywords to the model and the model generates a text paragraph that mostly talks about the given keyword/topic. Given are the Tables 2, 3, 4, 5 and 6 show output against five different user given inputs.

Table 2