Top

Complex & Intelligent Systems

Published in:

Open Access 01-08-2022 | Case Study

Zero-shot domain paraphrase with unaligned pre-trained language models

Authors: Zheng Chen, Hu Yuan, Jiankun Ren

Published in: Complex & Intelligent Systems | Issue 1/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Automatic paraphrase generation is an essential task of natural language processing. However, due to the scarcity of paraphrase corpus in many languages, Chinese, for example, generating high-quality paraphrases in these languages is still challenging. Especially in domain paraphrasing, it is even more difficult to obtain in-domain paraphrase sentence pairs. In this paper, we propose a novel approach for domain-specific paraphrase generation in a zero-shot fashion. Our approach is based on a sequence-to-sequence architecture. The encoder uses a pre-trained multilingual autoencoder model, and the decoder uses a pre-trained monolingual autoregressive model. Because these two models are pre-trained separately, they have different representations for the same token. Thus, we call them unaligned pre-trained language models. We train the sequence-to-sequence model with an English-to-Chinese machine translation corpus. Then, by inputting a Chinese sentence into this model, it could surprisingly generate fluent and diverse Chinese paraphrases. Since the unaligned pre-trained language models have inconsistent understandings of the Chinese language, we believe that the Chinese paraphrasing is actually performed in a Chinese-to-Chinese translation manner. In addition, we collect a small-scale English-to-Chinese machine translation corpus in the domain of computer science. By fine-tuning with this domain-specific corpus, our model shows an excellent capability of domain-paraphrasing. Experiment results show that our approach significantly outperforms previous baselines regarding Relevance, Fluency, and Diversity.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Text Paraphrase can be viewed as the task of expressing the same semantic content of the original text in a different way [1]. It requires that while keeping the core semantics of the original text, the paraphrased text needs to be as diverse as possible. As an essential task of Natural Language Processing, Text Paraphrase has a wide range of applications. For example, Semantic Parsing [2], Relation Extraction [3, 4], Question Answering [5, 6], and Dialog Generation [7] all benefit from Text Paraphrase. However, the research progress on Text Paraphrase is not satisfactory. Natural Language Processing has recently made tremendous progress in many other sub-areas, such as Machine Translation, Open-ended Generation, and Question Answering. These advances mostly rely on large-scale neural models, enormous computing power, and large-scale corpora. For Text Paraphrase, the challenge is that it is difficult to acquire high quality and large enough paraphrase corpus. Some researchers manage to extract paraphrase corpora from the large-scale online text in an unsupervised manner [8, 9]. Even without considering how inefficient these approaches are, the retrieved paraphrase texts are still confronted with severe problems such as semantic incoherence and lack of diversity. Moreover, since domain-specific paraphrase texts are even more scarce and hard to retrieve, domain-specific paraphrase generation remains rarely explored.

Round-trip translation, which requires no paraphrase corpora, is a ready-to-use paraphrase generation method. In this approach, a machine translation system is used to translate an inputted source sentence in one language into a mediator sentence in a different language, and then translate this mediator sentence back into the source language. For example, to paraphrase “”, we first perform a Chinese-to-English translation, translating the source sentence to the mediator-sentence “The weather is really nice today!”. Then, we translate the mediator-sentence back into the source language via an English-to-Chinese translation, to obtain the paraphrase sentence, “”. With this pivoting approach, the scarcity problem of the paraphrase corpus is avoided. However, the quality of the generated paraphrase text is relatively low. The reason is that the performance of this paraphrase approach is mainly dependent on the performance of the machine translation. However, the problem of machine translation is far from solved, especially in generating faithful, expressive, and elegant text. Moreover, the sentences need to be translated twice. This leads to double the loss of semantic information during translation, thus hurting the consistency between the paraphrase and the original sentence.

Language Model-based zero-shot paraphrase generation is considered to be one of the most promising approaches. Guo et al. [10] proposed the first zero-shot paraphrase generation model. They trained a Transformer-based language model [11] using multi-lingual parallel corpora. In the paraphrase generation phase, given input in one language, the model could be guided to generate a paraphrase output in the same language. Compared to round-trip translation, this method allows the model to learn the representation of an input sentence to paraphrase the given sentence directly, thus minimizing the semantic loss during the paraphrasing process. This method, however, also has some severe drawbacks. First, it must be trained over large-scale multi-lingual parallel corpora. It also draws on the idea of denoising auto-encoder (DAE) to augment the training data. The training with DAE is extremely time-consuming, making it difficult for small research groups to reproduce this approach. In addition, an additional language identifier needs to be added to guide the model to generate the paraphrase in the same language as the input sentence. We deeply question the effectiveness of the language identifier. Since in our reproduction, controlling the output language, especially Chinese, is far more problematic than just adding a language identifier.

Inspired by the research of Guo et al. [10], we propose Zeppel, a zero-shot paraphrase method based on unaligned pre-trained language models, which significantly reduces the difficulty and cost of the training process. We use the sequence-to-sequence model as the base architecture, with the multilingual BERT [12] as the encoder and the Chinese GPT2 [13] as the decoder. We train our model over an English–Chinese bilingual parallel corpus (the source language is English, and the target language is Chinese). In the inference phase, when a sentence is inputted, regardless of the language that the sentence is used, the model could generate a Chinese sentence to express the same semantic content. Therefore, when inputting an English sentence, this model behaves like a state-of-the-art English-to-Chinese translation model. When inputting a sentence in another language, Japanese, for example, this model acts as a Japanese-to-Chinese translation model. However, when inputting a Chinese sentence, this model becomes a great Chinese-to-Chinese paraphrase model. There is no need to provide a language identifier to guide the generation since the output sentences are always in Chinese. By leveraging pre-trained language models, our approach could generate paraphrased text with both Relevance and Fluency without training over a large amount of corpus or employing a DAE to augment the given training data. We further conduct research on a domain-specific paraphrase task using a relatively small academic corpus in the domain of Computer Science. The generated high-quality paraphrase texts illustrate the data efficiency of our approach.

Our contributions in this paper are as follows: (1) we propose an effective method to build a paraphrase model using a pre-trained multilingual auto-encoder language model and a pre-trained monolingual auto-regressive language model, and train it with a bilingual corpus (“The Zeppel model”, “Model training” sections); (2) we propose a modified version of Diverse Beam Search for improving diversity not only between the output beam groups but also between the input and output sentences, thus more suitable for the task of paraphrase generation (“Paraphrase generation” section); (3) we construct an English–Chinese bilingual corpus in the domain of Computer Science (“Datasets” section), and then train a domain-specific paraphrase model on that corpus. The data, code, and model will be released upon acceptance of this paper.

Traditional paraphrase methods

Most traditional text paraphrasing is based on a thesaurus or predefined rules. A thesaurus-based paraphrase generation system generates paraphrases by replacing some words with their corresponding synonyms [14, 15]. This replacement could be carried out at both lexical level and phrase level. Early rule-based paraphrase generation methods generally rely on manually written paraphrase rules or patterns [16]. Later, some researchers have proposed to extract paraphrase rules automatically [17, 18] instead of manually writing them to support better diversity. Sentence splitting and combining [19] is also considered as a rule-based approach. These traditional paraphrase generation methods are often poor in terms of flexibility, stability, and textual quality.

Translation-based paraphrase methods

Due to its simplicity and convenience, the translation-based paraphrase methods are one of the most commonly used methods in a real-world setting. Round-trip translation proposed by Mallinson et al. [20] and back-translation proposed by Wieting et al. [21] both translate the input sentence into a different language and then back-translate the translation result into the original language as the paraphrase text. These two-step text paraphrase methods utilizing ready-to-use translation systems benefit significantly from the rapid development of machine translation technology. However, they also come with the problem of losing semantic information during the translation process. Other than directly building a paraphrase system on top of a translation system, machine translation also helps to build corpora for Paraphrase Identification [22] or Evaluation [23].

Seq2seq-based paraphrase methods

With the development of deep learning technology, neural models based on the sequence to sequence (seq2seq) architecture [24] are applied to paraphrase tasks. Based on a Stacked Residual LSTM Network [25], Prakash et al. [26] is the first to explore deep learning models for paraphrase generation. Inspired by the idea of CopyNet [27], Cao et al. [28] proposed a paraphrase generation model based on copy mechanism. They also changed the backbone of the model into Gated Recurrent Neural Network [29]. Gupta et al. [30] combined a Variational Autoencoder [31] with a Seq2Seq model to generate richer and more diverse paraphrase texts. Egonmwan et al. [32] integrated the Transformer model [11] into the seq2seq architecture to further improve the performance of paraphrase systems. In practice, by training over large-scale and high-quality paraphrase corpus, like PPDB [33], WikiAnswers [34], and MSCOCO [35], these seq2seq-based deep learning models could achieve satisfying results.

However, languages other than English do not have such paraphrase corpus. Hence, identifying paraphrase sentence pairs from large-scale online text becomes crucial when building a paraphrase system in a low-resource language [36‐38]. Nevertheless, even the authors themselves admit that an automatically extracted paraphrase corpus can never reach the quality of human-written ones. Since we mainly explore the zero-shot approach in this paper, we will not give further details about the research on paraphrase identification.

Zero-shot paraphrase methods

Zero-shot paraphrasing, which is highly correlated with the Pre-train and Fine-tune Paradigm, is the latest and the most promising approach. Guo et al. [10] proposed the first zero-shot paraphrase generation model by pre-training a multilingual language model and then fine-tuning it with a parallel corpus. Thompson and Post [39] also found out that a well-trained Multilingual Translation system could generate paraphrases in a zero-shot fashion. They [40] further improved the diversity of the generated paraphrases by discouraging the production of n-grams that are present in the input. Fan et al. [41] borrowed the idea of unsupervised machine translation and proposed a purely zero-shot approach, which even does not need a parallel corpus. However, the iterative back-translation procedure is even more time-consuming. By employing a reinforcement learning procedure, Siddique et al.’s approach [42] is also extremely time-consuming. These researches deeply inspire our work. However, we aim to avoid the extremely resource-intensive and time-consuming training process of these works and make it possible to apply zero-shot paraphrasing in a low-resource language or domain.

Methods

The Zeppel model

The Zeppel is based on a seq2seq architecture. The encoder is initialized by the multilingual BERT and the decoder is initialized by the Chinese GPT2 (see Fig. 1). The multilingual BERT and the Chinese GPT2 are pre-trained separately, thus have different representations of the same Chinese token. Therefore, we call them unaligned pre-trained language models. Since the models have no idea that the input Chinese tokens and the output Chinese tokens are in the same language, the Chinese paraphrasing in our model could perform in a Chinese-to-Chinese translation manner. This is the key difference between our approach with the former zero-shot approaches.

In the training phase, Zeppel is trained to maximize the likelihood:

$$\begin{aligned} L_{(\theta )}=\sum _{t=1}^{n}\mathrm{log}P_{(\theta )}\left( y_t |y_1,\ldots ,y_{t-1};X;\theta \right) , \end{aligned}$$

(1)

where $y_t\in Y$ is the target sequence, and $y_t$ refers to the token generated by the model at the timestep t. X is the input sentence to be paraphrased. And $\theta $ denotes the parameters of the model.

In the inference phase, given an input sentence X, Zeppel paraphrases it as follows:

The encoder vectorizes it with the vocabulary and embedding layer of multilingual BERT to get $\mathbf {E_x}$, then encodes the vector $\mathbf {E_x}$ to get the output of BERT $\mathbf {H_{mBERT}}$:

$$\begin{aligned} \mathbf {E_x}= & {} \mathrm{Embedding}_\mathrm{mBERT}(X), \end{aligned}$$

(2)

$$\begin{aligned} \mathbf {H_{mBERT}}= & {} \mathrm{mBERT}(\mathbf {E_x}). \end{aligned}$$

(3)

The input of the decoder is $\mathbf {H_{mBERT}}$ and the generated sequence $Y_{t-1}$ at the current time step. The decoder vectorizes $Y_{t-1}$ with the vocabulary and embedding layer of Chinese GPT2 to get $\mathbf {E_{Y_{t-1}}}$. Then through the decoder, the output token $y_{t}$ of the next time step is obtained:

$$\begin{aligned} \mathbf {E_{Y_{t-1}}}= & {} \mathrm{Embedding}_{\mathrm{GPT2}_\mathrm{zh}}(Y_{t-1}), \end{aligned}$$

(4)

$$\begin{aligned} y_{t}= & {} \mathrm{GPT2}_\mathrm{zh}(\mathbf {H_{mBERT}},\mathbf {E_{Y_{t-1}}}). \end{aligned}$$

(5)

Model training

The training of the Zeppel is simple and straight forward. We feed an English sentence into the model and supervise its output with the corresponding Chinese sentence. When a domain-specific paraphrase model is needed, we could perform a two-step training. That is, we first use an English-to-Chinese general-domain parallel corpus to train the model, and then use a small scale English-to-Chinese parallel corpus in a specific domain to finetune it. We will discuss the effectiveness of this two-step training strategy in “Discussion” section.

For each training sample, special tokens are added according to the pre-trained model. For an English sentence, which will be input into the multilingual BERT, we add [CLS] and [SEP] to its head and tail, respectively. For the corresponding Chinese sentence, which will be input into the Chinese GPT2, we add [BOS] and [EOS] as its beginning and ending tokens. It is worth mentioning that there is no need to add additional language identifiers to the English or Chinese sentence. The decoder, Chinese GPT2, could only generate sentences in Chinese, while the encoder, multilingual BERT, could understand as much as 104 languages. Hence, even if the language of a sentence that fed to the encoder in the inference phase is inconsistent with that of the one in the training phase, benefiting from the multilingual BERT’s multi-language understanding ability, similar representations can be obtained for sentences with close semantics. The Chinese GPT2 learns to generate the Chinese sentence based on that inputted semantic representations from the English-to-Chinese training corpus. That is, the multilingual BERT to Chinese GPT2 model can perform machine translation from 104 languages to Chinese, including Chinese-to-Chinese machine translation. However, the multilingual BERT and the Chinese GPT2 have different dictionary and different encoding matrix; therefore, they have an inconsistent understanding of the Chinese language. Thus, the Zeppel could perform the Chinese paraphrasing in the $\mathrm {Chinese_{mBERT}\text{- }to\text{- }Chinese_{GPT2_{zh}}}$ translation manner.

Paraphrase generation

In paraphrase generation, diversity is more important than other constrained text generation tasks, such as Abstractive Summarization and Machine Translation. Nevertheless, traditional maximum likelihood decoding algorithms such as Beam Search do not take diversity into consideration. Diverse Beam Search [43], a diversity-enhanced beam search algorithm, only considers the diversity among the output beam groups. However, in the paraphrase generation task, the diversity between the input and the output sentences is more critical. Hence, we modify the Diverse Beam Search, take into account the diversity not only between the output beam groups but also between the input and output sentences, utilize Hamming distance, N-gram repetition, and Levenshtein distance as the diversity measure, and propose Diverse Beam Search for Paraphrase Generation (DBS-PG) as follows.

Let G be the number of beam groups and B be the beam size. When decoding the tth token of bth beam, DBS-PG select the word $w_{t}^{b}$ from the candidate words $\mathrm{Cand}_{t}^b$(size:V) as follows:

$$\begin{aligned} w_{t}^{b}=\mathop {\arg \max }\limits _{{\mathrm{Cand}_{1, t}}^{b}, \ldots , {\mathrm{Cand}_{V, t}}^{b}} \sum _{v=1}^{V}\left[ \theta \left( \mathrm{Cand}_{v, t}^b \right) -H_{v, t}^{b}-N_{v, t}^{b}\right] ,\nonumber \\ \end{aligned}$$

(6)

where $\theta $ is the logarithm of this conditional probability distribution over all words. For decoding the tth token, $\theta (w_{t})=\log P_{r}\left( w_{t} \mid w_{t-1}, \ldots , w_{1}, x \right) $. $H_{v, t}^{b}$ denotes the Hamming distance penalty for the candidate word. It is computed as follows:

$$\begin{aligned} H_{v, t}^{b}= {\left\{ \begin{array}{ll}0 &{} , \mathrm{Cand}_{v, t}^{b} \notin S_{H} \\ \mathrm{HDP} &{} , \mathrm{Cand}_{v, t} \in S_{H}\end{array}\right. }, \end{aligned}$$

(7)

where $S_{H}$ is the set consisting of the words of the tth token of the input sentences X and the b-1 beams that have been generated. HDP is the value of Hamming distance penalty, which is 2.5 by default. $N_{v, t}^{b}$ denotes the N-gram repetition penalty. It is computed as follows:

$$\begin{aligned} N_{v, t}^{b}= {\left\{ \begin{array}{ll}0, &{} \left[ w_{t-n+1}^{b},\ldots ,w_{t-1}^{b},\mathrm{Cand}_{v, t}^{b}\right] \not \subset S_{N} \\ \mathrm{NRP}, &{} \left[ w_{t-n+1}^{b},\ldots ,w_{t-1}^{b},\mathrm{Cand}_{v, t}^{b}\right] \subset S_{N}\end{array}\right. }, \end{aligned}$$

(8)

where $S_{N}$ is a n-gram set from the input sentence X and the b-1 beams that have been generated. NRP is the value of N-gram repetition penalty, which is 2.5 by default.

After each beam reaches its [EOS], we will obtain multiple sets of generated sentences $Y_\mathrm{list}=[Y_1,Y_2,\ldots ,Y_n]$ . The sentences in $Y_\mathrm{list}$ are arranged in descending order according to their joint probabilities. Then, we need to select the most diverse paraphrase $Y_\mathrm{final}$ from the candidates $Y_\mathrm{list}$. We employ a Levenshtein distance threshold (LThreshold), which value is 0.25 by default. We compute the Levenshtein distance between $Y_i\in Y_\mathrm{list}$ and X in turn, and use $D_i$ to represent the value that the distance divided by the length of X. If $D_i$ is greater than the LThreshold, we choose $Y_i$ as the final paraphrase $Y_\mathrm{final}$. By default, we choose the last sentence $Y_n$ as final paraphrase $Y_\mathrm{final}$.

Experiment and results

Datasets

We employ the translation2019zh from the Large Scale Chinese Corpus for NLP [44] for general-domain training. The translation2019zh contains about 3.8 million high-quality English–Chinese parallel sentence pairs. As for domain-specific training, we crawled a small corpus in the computer science domain from Journal of Software¹ and Computer Science². We crawled the Chinese and English abstracts of 6931 and 19782 papers from their websites, respectively. By performing sentence segmentation, aligning the sentences by applying the vecalign³, and removing poorly aligned sentence pairs, we obtained 59562 English–Chinese parallel sentence pairs in the computer science domain. For automatic evaluation, we randomly split this domain-specific corpus into an evaluation set of 3000 sentence pairs, and the rest are used as the training set. Moreover, we also randomly selected 200 sentence pairs from the evaluation set for human evaluation. Table 1 shows the statistics of our dataset.

Table 1

Statistics of training and evaluation dataset

	Training	Evaluation$_\mathrm{Automatic}$	Evaluation$_\mathrm{Human}$
General domain	3.8M	–	–
Computer Science domain	56k	3k	200

Evaluation metrics

Automatic evaluation

We compute cosine-similarity between the semantic representations of the paraphrases and the input sentences to evaluate the semantic consistency of the generated paraphrases. The semantic representation of a sentence is computed using the text2vec⁴ with the bert-base, chinese⁵ serves as the word vector. We also employ Distinct-2⁶ and Inverse Self-BLEU(defined as: $1-$Self-BLEU) [45] as metrics to evaluate the diversity of generated paraphrases. Inverse Self-BLEU is calculated between an original sentence and its paraphrase to evaluate their dissimilarity, while Distinct-2 is calculated based on the generated sentence only to evaluate whether a sentence could be expressed in diverse manners.

Human evaluation

We recruited six human annotators to rate the generated paraphrases with three metrics: Relevance, Fluency, and Diversity, respectively. Scores range from 0 to 5, the higher the better. Each paraphrase sample is rated by at least two annotators. During evaluation, the annotator is unaware of which model the paraphrase sample came from.

Implementation and baselines

We implement our model in Pytorch, using the Transformers⁷ library provided by HuggingFace. The multilingual BERT we use as the encoder has 12 hidden layers, 12 attention heads, and 768 hidden state dimensions. The Chinese GPT2 we use as the decoder also has 12 hidden layers, 12 attention heads, and 768 hidden state dimensions. The vocabulary sizes of multilingual BERT and Chinese GPT2 are different, which are 119547 and 21128. Moreover, we follow the pseudo self-attention approach introduced by Ziegler et al. [46], thus minimizing the extra parameters in the sequence-to-sequence architecture. For training, we set learning rate to $5e-5$, and the training batch size to 32. It takes less than 24 hours to train our model, using a single Nvidia RTX 3090 GPU. For paraphrase generation, we set the number of groups of Diverse Beam Search to 5, and each group’s beam size is 2. The penalties of Hamming Diversity and n-gram Diversity are both 2.5.

To verify the effectiveness of our method, we employ round-trip translation and trans-paraphrasing as baselines. To implement round-trip translation, we introduce English as the pivot language, and train two translation models, Chinese-to-English and English-to-Chinese, separately. These translation models also use BERT as the encoder, and GPT2 as the decoder. We use the same training dataset to train these models in the same two-step fashion. To perform paraphrase generation, a source sentence is first translated into English, and then translated back into Chinese.

To implement trans-paraphrasing, we first translate the English sentences of the aligned bilingual corpus into Chinese, and then train a sequence-to-sequence paraphrase model using this $\mathrm {Chinese\text{- }to\text{- }Chinese_{translated}}$ pseudo -paraphrase corpus. Since the corpus is obtained via translation, the generated paraphrase texts retain noticeable translationese characteristics. Hence, this approach is called trans-paraphrasing.

We also provide the results of Guo et al.’s approach [10] for comparison. We call their model the unpretrained-paraphrasing since the key difference between their model and ours is that we utilize pretrained language models while theirs are not. As suggested, we train the unpretrained-paraphrasing model from scratch using the MultiUN⁸ and OpenSubtitles⁹. Then, we finetune it using the same corpus we used for a fair comparison. The training process is augmented utilizing the DAE, while the finetuning process is not. Guo et al.’s original implementation employed a Top-K sampling algorithm [47] for decoding. Stochastic decoding algorithms certainly provide better diversity while potentially compromising the quality of the generated paraphrase. Thus, we also report the results decoding using the DBS-PG we propose in this paper. In addition, during decoding, we filtered the tokens that never appeared in the Chinese corpus to ensure that the generated paraphrase texts are in Chinese. Generating non-Chinese characters can have a remarkably negative impact on the quality of the paraphrasing. However, such a problem happens occasionally, especially with Top-K sampling decoding.

Results and analysis

Table 2

Automatic evaluation results of different paraphrase approaches

	Cosine similarity	Distinct-2	Inverse self-BLEU
Round-trip translation	0.781	0.824	0.576
Trans-paraphrasing	0.782	0.793	0.630
Unpretrained-paraphrasing [10]	0.799	0.807	0.589
with DBS-PG	0.810	0.792	0.551
Zeppel(ours)	0.823	0.794	0.535

Bold font indicates the best performance for each metric

Table 2 shows the result of automatic evaluation. We can find that Zeppel achieves 0.823 in cosine similarity, which is the highest score compared to the baseline approaches. Such a result indicates that the paraphrase texts generated by Zeppel have much better semantic consistency than those generated by the baselines. Meanwhile, Zeppel also achieves competitive results in the Distinct-2 and Inverse Self-BLEU. One thing needs to be pointed out that though the Distinct-2 and Inverse Self-BLEU are also higher the better. However, they need to be judged under the same similarity level. Without a high similarity score, high diversity scores make no sense, since a randomly generated sentence could achieve the highest diversity score. Generally speaking, Distinct-2 higher than 0.75 and Inverse Self-BLEU higher than 0.5 could be considered good enough.

Unpretrained-paraphrasing with DBS-PG achieves a 0.810 cosine similarity, the second-highest among these approaches, 0.11 higher than that of the original Unpretrained-paraphrasing, which was decoded using the Top-K Sampling. However, the original Unpretrained- paraphrasing showed better results in Distinct-2 and Inverse Self-BLEU. Such results corroborate our assumption that the DBS-PG could produce paraphrases with better semantic similarity, while the Top-K Sampling could provide better diversity.

Comparing our Zeppel with the Unpretrained- paraphrasing with DBS-PG, Zeppel performance better on Cosine similarity, which is the most critical metric, while a little bit worse on Inverse Self-BLEU. The difference on Distinct-2 is marginal. We argue that the BERT and GPT2 within Zeppel have been pre-trained over a large corpus and thus have better capabilities in transferring their knowledge into the domain of Computer Science. The Unpretrained-paraphrasing model has only been trained on the MultiUN and OpenSubtitles. Although these two corpora are relatively large, the knowledge within is still limited and cannot be compared to the large corpus on which BERT and GPT2 were trained. Therefore, the Unpretrained-paraphrasing model struggled in producing paraphrases in the domain of Computer Science.

For further in-depth investigation of the paraphrase results, we randomly select 1000 samples, 200 by each model, from the automatic evaluation results and show them in Fig. 2. From this figure, we notice that the sample distributions of baselines are more scattered than those of Zeppel. This result indicates that the paraphrase text generated by Zeppel has better consistency in performance. In addition, the baseline models might have achieved remarkable consistency and diversity scores on average. However, a high consistency score does not always come with a high diversity score for a particular sample. A good paraphrased text requires consistency and diversity at the same time. For Zeppel, the coherence of these two indicators is much better.

Table 3

Human evaluation results of different paraphrase approaches

	Relevance	Fluency	Diversity
Round-trip translation	3.68	3.84	3.36
Trans-paraphrasing	3.28	3.50	3.22
Unpretrained-paraphrasing [10]	3.72	3.75	4.25
with DBS-PG	3.83	3.87	3.81
Zeppel(ours)	3.97	3.96	4.08

Bold font indicates the best performance for each metric

The results of the human evaluation are shown in Table 3. From the Table 3, we can tell that Zeppel outperforms the baselines significantly in Relevance and Fluency. The Unpretrained-paraphrasing achieves the best diversity scores due to the stochastic decoding algorithm it utilizes. However, the diversity hurts its quality as we expect. Hence, the Unpretrained-paraphrasing achieves the second-lowest fluency score, while the Relevance score is only slightly higher than the Round-trip translation. The DBS-PG decoding algorithm could improve its paraphrase quality, resulting in higher Relevance and Fluency scores, yet lowering its Diversity.

For the Diversity results, there is a surprisingly significant gap between the manual and automatic evaluations. The two translation-based baselines achieve outstanding diversity scores in automatic evaluation. However, their diversity scores in manual evaluation are surprisingly low. We suspect this is because they suffered severe information loss during generation, thus hurting sentence length. Human judges may tend to give lower diversity scores when evaluating shorter sentences. Since beam search is one of the causes of the length problem [48], the Unpretrained-paraphrasing decoded using Top-K Sampling could avoid such problem and still get a remark performance in manual evaluation. In the appendix, we give out some comparative examples for the reader to perform a subjective evaluation on their own.

Discussion

Comparison of different training strategies

We conduct a experiment to verify the effectiveness of the two-step training strategy. For comparison, we train two paraphrase models only use the training corpus in the general domain or the computer science domain, respectively. The result is shown in Table 4. From the table, we can tell that the two-step training strategy achieves the highest scores in Cosine similarity and Distinct-2. Meanwhile, $\mathrm {Zeppel_{domain}}$, which trained over a relative small corpus, performs the worst. This means that the effect of the domain paraphrase is highly dependent on the scale of the corpus as well as the domain-specific corpus.

Table 4

Automatic evaluation results of different training strategies

	Cosine similarity	Distinct-2	Inverse self-BLEU
$\mathrm {Zeppel_{general}}$	0.787	0.767	0.673
$\mathrm {Zeppel_{domain}}$	0.728	0.655	0.899
$\mathrm {Zeppel_{twostep}}$	0.823	0.794	0.535

Bold font indicates the best performance for each metric

Comparison with aligned pre-trained models

We train a paraphrase model in the Computer Science domain for comparative experiments to compare with Zeppel. The model’s encoder and decoder both are multilingual BERT(aligned models). To eliminate the influence of other factors, we did not adopt a special generation method for both models when conducting comparative experiments. The paraphrases are generated by the greedy search.

The result is shown in Table 5. We can see from the table, the Inverse Self-BLEU obtained by the aligned model is far lower than Zeppel, while the two models are not much different between cosine similarity and Distinct-2. It is shown that, because the encoder and decoder of the aligned models use the same model, they have a consistent understanding of the Chinese language. Thus, the paraphrases generated by the aligned model and the input sentence are highly similar not only in semantics, but also in content.

Table 5

Automatic evaluation results of aligned model and unaligned model

	Cosine similarity	Distinct-2	Inverse self-BLEU
BERT2BERT(aligned)	0.896	0.805	0.176
Zeppel(unaligned)	0.891	0.808	0.261

Bold font indicates the best performance for each metric

Comparison with different decoding algorithms

We also run an experiment to verify the effectiveness of the Diverse Beam Search for Paraphrase Generation, the decoding algorithm we proposed for the purpose of improving the diversity of paraphrased text. We employ the Greedy Search, Beam Search, and Diverse Beam Search as baseline decoding algorithms to generate paraphrase text for comparison. The result is shown in Table 6.

As can be seen from the table, Greedy Search and Beam Search achieved the finest performances on Cosine similarity due to their maximum-likelihood decoding objective, meaning paraphrased texts generated by these algorithms are most semantically similar to input text. However, their performance on Inverse Self-BLEU is not satisfactory. Such a low Inverse Self-BLEU score indicates that a significant portion of the generated n-grams is repetitive with the input text. The Diverse Beam Search decoding algorithm remarkably improves the Self-BLEU score of the generated paraphrased texts. That is why most existing text paraphrasing systems adopt it as the default decoding algorithm. However, the improvement of diversity harms the semantic expression accuracy of the generated text. The Cosine similarity score decreased by nearly 10 percent.

By utilizing Hamming distance penalty, N-gram repetition penalty, and Levenshtein distance threshold between the input text and the output beam groups, our DBS-PG achieves the best diversity performance (Inverse Self-BLEU) while still gaining a remarkable improvement in semantic expression accuracy (Cosine similarity) over Diverse Beam Search. To better understand how these penalties and threshold contribute to the overall performance, we further perform three ablation studies that run DBS-PG without one of these penalties and threshold. The results are also shown in Table 6. We see that the Levenshtein distance threshold contributes a lot to the Inverse Self-BLEU. The DBS-PG without Levenshtein distance penalty decreases by 0.136 compared with the full-form DBS-PG since it forces the generated texts to be different from the input text. However, it shows a 0.020 increase in the Cosine similarity, which means introducing the Levenshtein distance threshold will slightly hurt the semantic representation of the produced paraphrases. Both Hamming distance and N-gram repetition penalties could improve the Cosine similarity. We argue that these penalties could expand the search space of the beam search, which leads to finding better semantically similar paraphrases. Hamming distance threshold and N-gram repetition penalty also contribute to the Inverse Self-BLEU since we take input text into consideration, other than only considering the penalties between generated groups as in the vanilla DBS. The N-gram repetition penalty also slightly improves the Distinct-2. However, due to the excellent text generation capability of the Chinese GPT2, the paraphrased texts generated by all decoding algorithms achieve high Distinct-2 scores. Their difference is minor. High Distinct-2 scores indicate that the generated texts are qualitative and informative from a stand-alone perspective.

Table 6

Automatic evaluation results of different decoding algorithms

	Cosine similarity	Distinct-2	Inverse Self-BLEU
Greedy search	0.891	0.808	0.261
Beam search	0.881	0.805	0.250
Diverse Beam search	0.803	0.799	0.348
DBS-PG(ours)	0.823	0.794	0.535
without Levenshtein	0.843	0.795	0.399
without hamming	0.811	0.795	0.501
without N-gram	0.820	0.799	0.515

Bold font indicates the best performance for each metric

Comparison of training costs of different models

In Table 7, we list the model size, the training corpus, and the cost of our Zeppel model. As a comparison, we also list two other deep neural network-based zero-shot paraphrase models, which are Thompson and Post’s and Guo et al.’s models. Thompson and Post employ a vanilla sequence-to-sequence transformer model with 745M parameters. They trained their model with 99.8 million sentences in 39 languages on a server with four Nvidia RTX 2080ti GPUs for about six weeks. According to their paper, the cost of a single training procedure is close to $13000. Guo et al. ’s model is an auto-regressive transformer model, that is, a GPT-like model. Its parameter size is 110 million. They trained their model with 125.9 million sentences, about 25% more than Thompson and Post’s. However, the training cost should be much lower since their model’s parameter size is relatively small. They did not disclose the hardware platform and the time consumption of their training procedure. A reasonable guess is that it may also take several days for such a model to converge by training over a multi-GPU server. We re-implement their approach. It costs us eight days of training over a server with eight 3090 GPUs. The final loss we got was 0.821. Further training could possibly enhance its performance. And, of course, there will be more expenses. Since our Zeppel model is based on multilingual BERT and Chinese GPT2 models, both well pre-trained, it only takes us less than 24 hours to fine-tune Zeppel over a single-GPU workstation. However, considering the knowledge within BERT and GPT2’s corpora is much larger than that of Thompson and Post and Guo et al.’s. As a result, the Zeppel model could achieve better language understanding and generation capability than the models that are trained from scratch with significantly less training effort.

Table 7

Training costs of different models

	Architecture	Parameters	Corpus	Time
Thompson and post [40]	Seq2seq	745M	99.8M	6 weeks on 4 $\times $ 2080
Guo et al. [10]	GPT-like	110M	125.9M	8 days on 8 $\times $ 3090
Zeppel(ours)	Seq2seq	212M	3.8M	24 h on 1 $\times $ 3090

Multilingual BERT(110M parameters) was trained for 4 days on four Cloud TPUs in Pod configuration using Wikipedia texts in 104 languages.

Chinese GPT2(102M parameters) was trained for 49 days on four Nvidia RTX 2080ti GPUs using 14G Chinese texts

Conclusion

In this work, we propose a novel zero-shot domain paraphrase approach named Zeppel. We train it with an English-to-Chinese aligned bilingual corpus. Then, by inputting a Chinese sentence into it, this model could surprisingly generate fluent and diverse Chinese paraphrases. Experiment results show that our approach significantly outperforms baselines regarding Relevance, Fluency, and Diversity. In the future, we would like to explore its applicability in an online academic paraphrase system, like Langsmith [49]. Moreover, since it is much easier to acquire a machine translation corpus than a paraphrase corpus, other low-resource languages with a decent pretrained autoregressive language model, like Japanese with japanese-gpt2¹⁰, and Korean with KoGPT2¹¹, may also potentially benefit from our zero-shot paraphrasing approach.

A.1 Sample 1

A.2 Sample 2

A.3 Sample 3

A.4 Sample 4

Declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Scholarly knowledge graphs through structuring scholarly communication: a review

next article Correction to: Person identification from arm’s hair patterns using CT-twofold Siamese network in forensic psychiatric hospitals

Madnani N, Dorr BJ (2010) Generating phrasal and sentential paraphrases: a survey of data-driven methods. Comput Linguist 36(3):341–387. https://doi.org/10.1162/coli_a_00002MathSciNetCrossRef

Su Y, Yan X (2017) Cross-domain semantic parsing via paraphrasing. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 1235–1246. Association for computational linguistics, Copenhagen, Denmark . https://doi.org/10.18653/v1/D17-1127. https://www.aclweb.org/anthology/D17-1127

Romano L, Kouylekov M, Szpektor I, Dagan I, Lavelli A (2006) Investigating a generic paraphrase-based approach for relation extraction. In: 11th Conference of the European Chapter of the association for computational linguistics. Association for computational linguistics, Trento, Italy. https://www.aclweb.org/anthology/E06-1052

Yu J, Zhu T, Chen W, Zhang W, Zhang M (2020) Improving relation extraction with relational paraphrase sentences. In: Proceedings of the 28th international conference on computational linguistics, pp 1687–1698. International committee on computational linguistics, Barcelona, Spain (Online) . https://doi.org/10.18653/v1/2020.coling-main.148. https://www.aclweb.org/anthology/2020.coling-main.148

Gan WC, Ng HT (2019) Improving the robustness of question answering systems to question paraphrasing. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 6065–6075. Association for computational linguistics, Florence, Italy. https://doi.org/10.18653/v1/P19-1610. https://www.aclweb.org/anthology/P19-1610

Dong L, Mallinson J, Reddy S, Lapata M (2017) Learning to paraphrase for question answering. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 875–886. Association for computational linguistics, Copenhagen, Denmark. https://doi.org/10.18653/v1/D17-1091. https://www.aclweb.org/anthology/D17-1091

Gao S, Zhang Y, Ou Z, Yu Z (2020) Paraphrase augmented task-oriented dialog generation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 639–649. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.acl-main.60. https://www.aclweb.org/anthology/2020.acl-main.60

Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: COLING 2004: Proceedings of the 20th international conference on computational linguistics, pp 350–356. COLING, Geneva, Switzerland. https://www.aclweb.org/anthology/C04-1051

Pronoza E, Yagunova E, Pronoza A (2016) Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. Commun Comput Inf Sci 573:146–157

10.

Guo Y, Liao Y, Jiang X, Zhang Q, Zhang Y, Liu Q (2019) Zero-shot paraphrase generation with multilingual language models. arXiv:1911.03597

11.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, Long Beach, CA, USA, pp 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

12.

Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota . https://doi.org/10.18653/v1/N19-1423. https://www.aclweb.org/anthology/N19-1423

13.

Du Z (2019) GPT2-Chinese: tools for training GPT2 model in Chinese language. https://github.com/Morizeyao/GPT2-Chinese

14.

Bolshakov IA, Gelbukh A (2004) Synonymous paraphrasing using wordnet and internet. In: International conference on application of natural language to information systems, pp 312–323. Springer

15.

Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: Proceedings of the human language technology conference of the NAACL, Main Conference, pp. 455–462. Association for Computational Linguistics, New York City, USA. https://www.aclweb.org/anthology/N06-1058

16.

Zhang Y, Yamamoto K (2002) Paraphrasing of Chinese utterances. In: COLING 2002: the 19th international conference on computational linguistics. https://www.aclweb.org/anthology/C02-1056

17.

Zong C, Zhang Y, Yamamoto K, Sakamoto M, Shirai S (2001) Approach to spoken chinese paraphrasing based on feature extraction. In: 6th natural language processing pacific rim symposium, pp 551–556

18.

Lin D, Pantel P (2001) Discovery of inference rules for question-answering. Nat Lang Eng 7(4):343–360CrossRef

19.

Kouda S, Fujita A, Inui K (2001) Issues in sentence-dividing paraphrasing: a empirical study. In: Proceedings of the annual conference of JSAI 15th Annual Conference, pp 6–6. The Japanese Society for Artificial Intelligence

20.

Mallinson J, Sennrich R, Lapata M (2017) Paraphrasing revisited with neural machine translation. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp 881–893. Association for Computational Linguistics, Valencia, Spain . https://www.aclweb.org/anthology/E17-1083

21.

Wieting J, Mallinson J, Gimpel K (2017) Learning paraphrastic sentence embeddings from back-translated bitext. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 274–285. Association for Computational Linguistics, Copenhagen, Denmark. https://doi.org/10.18653/v1/D17-1026. https://www.aclweb.org/anthology/D17-1026

22.

Suzuki Y, Kajiwara T, Komachi M (2017) Building a non-trivial paraphrase corpus using multiple machine translation systems. In: Proceedings of ACL 2017, Student Research Workshop, pp 36–42. Association for Computational Linguistics, Vancouver, Canada. https://www.aclweb.org/anthology/P17-3007

23.

Chen D, Dolan W (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the association for computational linguistics: human language technologies, pp 190–200. Association for Computational Linguistics, Portland, Oregon, USA. https://www.aclweb.org/anthology/P11-1020

24.

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27: annual conference on neural information processing systems, Montreal, Quebec, Canada, pp 3104–3112 . https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html

25.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

26.

Prakash A, Hasan SA, Lee K, Datla V, Qadir A, Liu J, Farri O (2016) Neural paraphrase generation with stacked residual LSTM networks. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, pp 2923–2934. The COLING 2016 Organizing Committee, Osaka, Japan . https://www.aclweb.org/anthology/C16-1275

27.

Gu J, Lu Z, Li H, Li VOK (2016) Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 1631–1640. Association for Computational Linguistics, Berlin, Germany. https://doi.org/10.18653/v1/P16-1154. https://www.aclweb.org/anthology/P16-1154

28.

Cao Z, Luo C, Li W, Li S (2017) Joint copying and restricted generation for paraphrase. In: Singh SP, Markovitch S (eds) Proceedings of the Thirty-First AAAI conference on artificial intelligence, February 4-9, 2017, San Francisco, California, USA, pp 3152–3158. AAAI Press, ???. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14527

29.

Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1724–1734. Association for Computational Linguistics, Doha, Qatar. https://doi.org/10.3115/v1/D14-1179. https://www.aclweb.org/anthology/D14-1179

30.

Gupta A, Agarwal A, Singh P, Rai P (2018) A deep generative framework for paraphrase generation. In: McIlraith SA, Weinberger KQ (eds) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative applications of artificial intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp 5149–5156. AAAI Press, ??? (2018). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16353

31.

Kingma DP, Welling M (2014) Auto-encoding variational bayes. In: Bengio Y, LeCun Y (eds) 2nd International conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. arXiv.org/abs/1312.6114

32.

Egonmwan E, Chali Y (2019) Transformer and seq2seq model for paraphrase generation. In: Proceedings of the 3rd Workshop on Neural Generation and Translation, pp 249–255. Association for Computational Linguistics, Hong Kong. https://doi.org/10.18653/v1/D19-5627. https://www.aclweb.org/anthology/D19-5627

33.

Pavlick E, Rastogi P, Ganitkevitch J, Van Durme B, Callison-Burch C (2015) PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 2: Short Papers), pp 425–430. Association for Computational Linguistics, Beijing, China (2015). https://doi.org/10.3115/v1/P15-2070. https://www.aclweb.org/anthology/P15-2070

34.

Fader A, Zettlemoyer L, Etzioni O (2013) Paraphrase-driven learning for open question answering. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 1608–1618. Association for Computational Linguistics, Sofia, Bulgaria. https://www.aclweb.org/anthology/P13-1158

35.

Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755 . Springer

36.

Mahmoud A, Zrigui A, Zrigui M (2017) A text semantic similarity approach for arabic paraphrase detection. In: International conference on computational linguistics and intelligent text processing, pp 338–349. Springer

37.

Srivastava S, Govilkar S (2017) A survey on paraphrase detection techniques for Indian regional languages. Int J Comput Appl 163(9):0975–8887

38.

Malajyan A, Avetisyan K, Ghukasyan T (2020) Arpa: Armenian paraphrase detection corpus and models. In: 2020 Ivannikov Memorial Workshop (IVMEM), pp 35–39, IEEE

39.

Thompson B, Post M (2020) Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In: Proceedings of the 2020 Conference on empirical methods in natural language processing (EMNLP), pp. 90–121. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.emnlp-main.8. https://www.aclweb.org/anthology/2020.emnlp-main.8

40.

Thompson B, Post M (2020) Paraphrase generation as zero-shot multilingual translation: disentangling semantic similarity from lexical and syntactic diversity. In: Proceedings of the Fifth Conference on Machine Translation, pp 561–570. Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.wmt-1.67

41.

Fan C, Tian Y, Meng Y, Peng N, Sun X, Wu F, Li J (2021) Paraphrase generation as unsupervised machine translation. arXiv:2109.02950

42.

Siddique A, Oymak S, Hristidis V (2020) Unsupervised paraphrasing via deep reinforcement learning. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1800–1809

43.

Vijayakumar AK, Cogswell M, Selvaraju RR, Sun Q, Lee S, Crandall D, Batra D (2016) Diverse beam search: decoding diverse solutions from neural sequence models. arXiv:1610.02424

44.

Xu B (2019) NLP Chinese corpus: large scale chinese corpus for NLP. Zenodo. https://doi.org/10.5281/zenodo.3402023CrossRef

45.

Zhu Y, Lu S, Zheng L, Guo J, Zhang W, Wang J, Yu YT (2018) A benchmarking platform for text generation models. arxiv 2018. arXiv:1802.01886

46.

Ziegler ZM, Melas-Kyriazi L, Gehrmann S, Rush AM (2019) Encoder-agnostic adaptation for conditional language generation. arXiv e-prints, 1908

47.

Fan A, Lewis M, Dauphin Y (2018) Hierarchical neural story generation. In: ACL 2018—56th annual meeting of the association for computational linguistics, proceedings of the conference, vol 1, pp 889–898. https://doi.org/10.18653/v1/p18-1082

48.

Koehn P, Knowles R (2017) Six challenges for neural machine translation. First workshop on neural machine translation, pp 28–39 arXiv:1706.03872. https://doi.org/10.18653/v1/w17-3204

49.

Ito T, Kuribayashi T, Hidaka M, Suzuki J, Inui K (2020) Langsmith: An interactive academic text revision system. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp 216–226. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.emnlp-demos.28. https://www.aclweb.org/anthology/2020.emnlp-demos.28

Title: Zero-shot domain paraphrase with unaligned pre-trained language models
Authors: Zheng Chen
Hu Yuan
Jiankun Ren
Publication date: 01-08-2022
Publisher: Springer International Publishing
Published in: Complex & Intelligent Systems / Issue 1/2023
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00820-8

Springer Professional

Zero-shot domain paraphrase with unaligned pre-trained language models

Abstract

Publisher's Note

Introduction

Traditional paraphrase methods

Translation-based paraphrase methods

Seq2seq-based paraphrase methods

Zero-shot paraphrase methods

Methods

The Zeppel model

Model training

Paraphrase generation

Experiment and results

Datasets

Evaluation metrics

Automatic evaluation

Human evaluation

Implementation and baselines

Results and analysis

Discussion

Comparison of different training strategies

Comparison with aligned pre-trained models

Comparison with different decoding algorithms

Comparison of training costs of different models

Conclusion

A.1 Sample 1

A.2 Sample 2

A.3 Sample 3

A.4 Sample 4

Declarations

Conflict of interest

Publisher's Note

Premium Partner

	Cosine similarity	Distinct-2	Inverse self-BLEU
\(\mathrm {Zeppel_{general}}\)	0.787	0.767	0.673
\(\mathrm {Zeppel_{domain}}\)	0.728	0.655	0.899
\(\mathrm {Zeppel_{twostep}}\)	0.823	0.794	0.535

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Traditional paraphrase methods

Translation-based paraphrase methods

Seq2seq-based paraphrase methods

Zero-shot paraphrase methods

Methods

The Zeppel model

Model training

Paraphrase generation

Experiment and results

Datasets

Evaluation metrics

Automatic evaluation

Human evaluation

Implementation and baselines

Results and analysis

Discussion

Comparison of different training strategies

Comparison with aligned pre-trained models

Comparison with different decoding algorithms

Comparison of training costs of different models

Conclusion

A.1 Sample 1

A.2 Sample 2

A.3 Sample 3

A.4 Sample 4

Declarations

Conflict of interest

Publisher's Note

Other articles of this Issue 1/2023

Dual graph characteristics of water distribution networks—how optimal are design solutions?

A distributed gradient algorithm based on randomized block-coordinate and projection-free over networks

The fuzzy Weighted Influence Nonlinear Gauge System method extended with D numbers and MICMAC

Intelligent depression detection with asynchronous federated optimization

An integrated distribution scheduling and route planning of food cold chain with demand surge

Fusing depth local dual-view features and dual-input transformer framework for improving the recognition ability of motion artifact-contaminated electrocardiogram

Premium Partner