Abstract

Language and vision are the two most essential parts of human intelligence for interpreting the real world around us. How to make connections between language and vision is the key point in current research. Multimodality methods like visual semantic embedding have been widely studied recently, which unify images and corresponding texts into the same feature space. Inspired by the recent development of text data augmentation and a simple but powerful technique proposed called EDA (easy data augmentation), we can expand the information with given data using EDA to improve the performance of models. In this paper, we take advantage of the text data augmentation technique and word embedding initialization for multimodality retrieval. We utilize EDA for text data augmentation, word embedding initialization for text encoder based on recurrent neural networks, and minimizing the gap between the two spaces by triplet ranking loss with hard negative mining. On two Flickr-based datasets, we achieve the same recall with only 60% of the training dataset as the normal training with full available data. Experiment results show the improvement of our proposed model; and, on all datasets in this paper (Flickr8k, Flickr30k, and MS-COCO), our model performs better on image annotation and image retrieval tasks; the experiments also demonstrate that text data augmentation is more suitable for smaller datasets, while word embedding initialization is suitable for larger ones.

1. Introduction

Language and vision are the two most essential parts of human intelligence for interpreting the real world around us and communicating with each other. Intuitively, the union of these systems will be important in the research of human intelligence and artificial intelligence. With the rapid development of machine learning (ML), especially deep learning (DL) [1], we get breakthroughs on both separate and union levels of language and vision processing. But, in the union level of language and vision processing, how to compare language and vision in a union way is still a problem. Visual semantic embedding (VSE) is proposed for tackling the problem. In the research of visual semantic, datasets usually provide an image with its corresponding description, and the given description is like a single word/phrase or a sentence. This allows us to unify the image representation [2] and word representation/embedding [3] into the same feature space. The visual semantic embedding learns a representation that allows semantically associated paired image and text into the same space; that is, visual semantic embedding learns a common feature space that represents the underlying domain structure, and their embeddings of image and text are semantically meaningful. This allows us to compare the given images and texts in a union way and achieve multimodal retrieval. But, in real-world retrieval tasks, the labelled training set is always small compared with the whole data in the system and the growing speed is much slower than the whole system in this information boomed era. How to use the limited training data to obtain a robust and efficient model becomes the challenge in VSE tasks.

In natural language processing (NLP), we face the same problem on the limitation of training data, and higher-level tasks like sentiment classification [4, 5], stance detection [6, 7], and answer generation [8] are all heavily dependent on the size and quality of training data to obtain a reasonable performance. To tackle this problem, many achievements have been made to dig the information of given training data or introduce other pretrained common-sense models. One popular study is to generate extra data by translating sentences into French and then back to English [9]; that is, we can translate the original sentence to any other language and translate it back to its original language using a pair of machine translation model. Other works have used predictive language models to replace synonym in the data to expand the size of original data [10] and apply data noising to smooth the expanded data [11]. Previous data augmentation methods were often time-consuming [911]. We use the newly proposed method called EDA (easy data augmentation) [12] and the classic word embedding model, Word2Vec [3], for tackling this problem.

The main contributions are summarized as follows:(i)Introduce text data augmentation for visual semantic embedding methods, which helps the methods obtain more information from training data and achieve better performance on image annotation and image retrieval tasks. Experiment results also show that, with text data augmentation, the model can achieve the same performance with less training data; that is, the proposed method requires less training data, which is exactly what we are trying to achieve.(ii)Introduce the pretrained word embedding to initialize the weight of the text encoder. When training the whole model jointly, the introduced pretrained word embedding can continually improve the text representation of the given corresponding description and reduce the training time of the whole model.

2.1. Visual Semantic Embedding

Traditional image annotation tasks only take the feature provided by the image itself; intuitively, this kind of methods can be widely used because of the low requirement of data. But this limits the performance of these methods; to expand the features used by the methods, visual semantic embedding has been proposed. Visual semantic embedding can embed the features of images and texts into the same space; with the help of this embedding, one can obtain a better performance in image annotation tasks. Frome et al. proposed DeViSE (deep visual-semantic embedding) [13] to perform zero-shot image classification. It uses Word2Vec [3] to represent the label words of the given image. Karpathy et al. proposed DeFrag (deep fragment embedding) [14] which uses an R–CNN (region-convolutional neural network) [15] model to extract the image features. Karpathy et al. also proposed VSA (deep visual-semantic alignments) [16] which combines R–CNN [15] and BRNN (bidirectional recurrent neural network) [17] to extract the image features and text features, respectively. Kiros et al. proposed UVSE (unifying visual-semantic embedding) [18] which uses a VGG-19 [19] network to extract image features and an LSTM (long short-term memory) [20] network to extract text features of the corresponding image. Faghri et al. proposed an extension of UVSE called VSE++ (visual-semantic embedding++) [21] which implements hard negative mining in the original UVSE. The other methods include using GRU (Gated Recurrent Unit) [22], TextCNN [23], and other methods as the text extractor to boost the performance of given tasks or introduce multimodal hashing methods to support efficient multimedia retrieval [24, 25]. Our learning framework is an extension of VSE++ with text data augmentation and triplet ranking loss with hard negative mining.

2.2. Text Data Augmentation

Previous works have proposed many techniques for data augmentation in natural language processing (NLP) topic. One popular study is to generate extra data by translating sentences into French and then back to English [9]; that is, we can translate the original sentence to any other language and translate it back to its original language using a pair of machine translation model. Other works have used predictive language models to replace synonym in the data to expand the size of original data [10] and apply data noising to smooth the expanded data [11]. Although these techniques are useful, they are not widely used in practice because they have a high cost of computational resources to obtain reasonable performance. In this work, we utilize the simple yet powerful method called easy data augmentation (EDA) [12], which is used to expand the information with given data and was proved to be efficient in text classification tasks.

2.3. Hard Negative Mining

Hard negatives are the ones that are wrongly divided into positive samples, which result in the highest loss in the training procedure. To mitigate this problem, hard negative mining (HNM) has been proposed. Hard negative mining performs the following steps. The whole samples are firstly classified by a classifier, and the classified hard negative samples are put into the negative sample set; then the classifier is continuously trained with an updated negative sample set. HNM is a well-known procedure in the context of sliding-window detectors [26] in object detection and semantic segmentation domains [15]. HNM is well studied in computer vision tasks like human detection [27] and face recognition [28, 29]. Our work is an extension of VSE++ [21]. The main contribution is that we improve the performance by text data augmentation and triplet ranking loss with hard negative mining, which are discussed in the sections titled “Text Data Augmentation” and “Triplet Ranking Loss with Hard Negative Mining.”

2.4. Word Embedding

Word embedding is an important basic topic in NLP. Since computers cannot directly process natural language, word embedding can transform natural language into computer processable values. The very basic word representation method is called one-hot representation, but this method does not include the information of a word's context. To address this problem, many context-based word embedding methods have been brought out. One of the most impressive and powerful tools called Word2Vec is proposed [3]. After Word2Vec, Glove [30] has been proposed to improve the representation quality using global word statistic information combined with context information. In 2017, Facebook proposed an efficient method called FastText [31], which brings the production level word embedding method to academics. Proposed FastText achieved good quality of vector while having a relatively fast training time. Our model uses the original Word2Vec to initialize the embedding layer of the text encoder.

3. Learning Framework

3.1. Dual-Normalized Visual Semantic Embedding Learning

In this section, we will propose a dual-normalized visual semantic embedding framework using deep neural network, as shown in Figure 1.

Our proposed framework can be separated into two parts: one part is used to extract features from the input image, and the other part is used to extract features from the corresponding text; then we use triplet ranking loss to train this network. As shown in Figure 1, the left part is the image processing part; it is a CNN-based network (in our case, we use VGG-19 architecture to extract the image feature), followed by a fully connected layer and a normalization layer. While the right part is the text processing part, it is a typical RNN-based text representation framework (in our case, we use GRU architecture for extracting text feature) with data augmentation; that is, this part contains a text augmentation layer, word embedding layer, RNN layer, and a normalization layer. After extracting features from image and text, we use similarity (inner product of two given input features) of given image and text as paired feature and train the network using triplet ranking loss with hard negative mining (details in the section titled “Triplet Ranking Loss with Hard Negative Mining”). It is worth pointing out that we are training using image and text pairs; the usage of text data augmentation can efficiently increase the number of training samples.

3.2. Text Data Augmentation

To stay consistent with EDA [12], we tested some augmentation operations that are widely used in computer vision tasks; from the experiments, we found that, by adding data augmentation, we can train more robust models. The details of text augmentation are discussed in this part. Table 1 summarizes the notations used in this paper.

For a given sentence in the training set with length l, we implement the following augmentation operations:(i)Synonym Replacement (SR). Randomly choose n words from the sentence which are not stop words. Replace each of these words with one of its synonyms chosen from WordNet [31] randomly.(ii)Random Insertion (RI). Find a random synonym of a random word in the sentence which is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.(iii)Random Swap (RS). Randomly choose two words in the sentence and swap their positions. Do this n times.(iv)Random Deletion (RD).Randomly remove each word in the sentence with probability p.

In the above operations, we setwhere denotes the percent of words to be changed for SR, RI, and RS based on the sentence length l.

For RD operation, to be simple, we can set

3.3. Triplet Ranking Loss with Hard Negative Mining

Triplet loss [32] is a product of deep metric learning [33], which takes a triplet as input and lets the anchor closer to the positive one and lets the anchor away from the negative one.

In deep metric learning, the triplet loss is represented aswhere [x]+ = max (0, x), Sa,p is the similarity of anchor xa and positive input xp, and Sa,n is the similarity of anchor xa and negative input xn. is the margin that lets the negative pairs away from each other.

Therefore, we define our triplet ranking loss aswhere i and t are the paired image and text; that is, in the given data, there exists image i with text description t. Meanwhile and are the nonpaired image and text.

To emphasise the hard negative mining, we formulate our final loss aswhere max denotes the concern of the hardest negative sample; only select the one with the biggest loss (i.e., hardest negative sample) as the loss of the model.

4. Experiments

Like the experiment settings in VSE++ [21], we tested image annotation task and image retrieval task in our experiments. In image annotation task, the input of this task is an image; our model needs to find out which text description is the right description for this image; and, in image retrieval task, the input becomes a description text, and find out which image best matches the corresponding text. To evaluate the results, we use Recall@K (R@K, higher is better) and Median rank (Med r, lower is better) as evaluation metrics.

4.1. Dataset

To evaluate the performance of the proposed framework, we test the methods on Flickr8k [34], Flickr30k [35], and MS-COCO [36]. The Flickr-based datasets contain 8,000 and 31,000 images, respectively, and we use 6,000 images for training, 1,000 images for validation, and 1,000 images for testing. While the MS-COCO dataset contains 329,000 images, we use 5,000 images for validation, 5,000 images for testing, and the rest of the images for training.

The dataset split method is the same as the method mentioned in [16]; the details can be seen in Table 2.

4.2. Model Settings

We set the batch size as 128 on all the datasets, using Adam [37] as the model optimizer, and set the initial learning rate as 2e-4; to better control the training learning rate, we set the learning rate as 10% for every 10 epochs (we set 30 epochs in total). For the image features extractor, we use VGG-19 architecture to extract the image features at 4096 dimensions; and, for the text features extractor, we use GRU architecture for extracting text features at 300 dimensions. To be more precise, we compare the performances with and without text data augmentation, as well as whether to initiate the embedding layer with Word2Vec. We implement the loss function of our proposed learning framework by triplet ranking loss with hard negative mining and set the margin as a fixed value of 0.2 in equation (5). We set the dimension of unified feature space as 1024 and use cosine similarity to measure the distance of image and text pairs. For the text data augmentation part, we follow the recommended usage parameters in [12] and set n at 4 and p at 0.1 in equations (1) and (2), respectively.

4.3. Performance with Text Data Augmentation

We perform experiments on Flickr8k, Flickr30k, and MS-COCO datasets. The results are depicted in Tables 35, respectively.

4.4. Experimental Results on Flickr8k Dataset

Table 3 shows the experimental results with and without text data augmentation on the Flickr8k dataset. “Aug” in the second column indicates the result with text data augmentation. We utilize the VSE++ [21] of our implementation as the baseline and compare the results with the models mentioned in the section titled “Dual-Normalized Visual Semantic Embedding Learning.” From Table 3, we obtain 28.2% improvement on image annotation task and 20.8% on image retrieval task over baseline model, in Recall@1 metric. Also, we obtain a lower Median rank on both image annotation and image retrieval tasks. This shows the improvement of our proposed learning framework.

4.5. Experimental Results on Flickr30k Dataset

Table 4 shows the experimental results on the Flickr30k dataset. We also obtain 5.5% improvement on image annotation task and 5.4% on image retrieval task in Recall@1 metric. We notice that the results of VSE++ on the Flickr30k dataset (in Table 4) are better than those of UVSE (VGG), while they are slightly worse than the ones on the Flickr8k dataset (in Table 3). Consider the data scale of the two Flickr-based datasets (8,000 to 31,000), we can know that the VSE++ model overfits on the Flickr8k dataset. This manifests that the text data augmentation helps alleviate the overfitting problem on small datasets (the larger the dataset is, the less likely it is to overfit). Furthermore, when comparing the improved performance on the two datasets, our model gains more on Flickr8k, the smaller one, which demonstrates that the text data augmentation is more suitable for smaller datasets.

4.6. Experimental Results on MS-COCO Dataset

Table 5 shows the experimental results on the MS-COCO dataset. The result lines of VSA and VSE++ are directly obtained from their corresponding public papers [16, 21]. Compared with the VSA model, our model achieved 17.4% and 23.4% improvement in image annotation and image retrieval tasks, respectively (in Recall@1 metric). It can be seen that, except for the Median rank metric (Med r), our model exceeds the VSA model in almost all metrics. This may indicate that the VSA model applies special training skills, or it is only a wrong record. Compared with the VSE++ model, we only get a 4.1% improvement in image annotation task (in Recall@1 metric), and the effect in image retrieval task is the same as that of the VSE++ model. This shows that the model improvement effect of text data augmentation on large datasets such as MS-COCO is limited because its text data is rich enough. According to the section titled “Performance with Word Embedding Initialization,” we can choose to use word embedding initialization to further improve the learning effect of the model.

4.6.1. Performance on Different Training Set Sizes

To introduce the performance details of text data augmentation, we run both full dataset and the following training set fraction (%): {10, 20, 30, 40, 50, 60, 70, 80, 90}. Figure 2 shows the performance across the two Flickr-based datasets. The x-axis shows the percent of the whole training dataset during training, and the y-axis indicates the sum of two Recall@1 on image annotation and image retrieval tasks. The blue solid line and the green-dotted one describe the results without and with the text data augmentation, respectively. The best sum of Recall@1 without text data augmentation, 28.3% and 49.3% on Flickr8k dataset and Flickr30k dataset, respectively, is achieved using 100% of the training data. Meanwhile, with only about 60% of the available training data with text data augmentation, we surpass those two numbers. We can also infer that text data augmentation is especially helpful with smaller datasets, for the performance with text data augmentation on Flickr8k gains much more margin than the one on Flickr30k.

4.6.2. Performance with Word Embedding Initialization

Table 6 shows the performance with word embedding initialization of our learning framework. “Aug” and “Word2Vec” in the second column denote the one with text data augmentation and using word embedding method, specifically, with Word2Vec as the GRU text feature extractor, separately. We also gain an average 3.5% and 8.9% improvement in Recall@1 metric on Flickr8k and Flickr30k datasets, respectively. By comparing the numbers on the two datasets, we may infer that word embedding initialization is more suitable for larger datasets.

4.6.3. Examples of Easy Data Augmentation

Figure 3 shows the easy data augmentation in a random selected training sample. These examples illustrate that the easy data augmentation can expand the text information and provide more diverse samples.

5. Conclusion

In this paper, we introduce text data augmentation and word embedding initialization to the visual semantic embedding learning framework based on recurrent neural networks, which can unify the representation spaces of image and text into the same feature space. In the image aspect, we apply the most widely used and effective convolutional neural networks. In the text aspect, we apply the recurrent neural networks which are good at processing sequential data and utilize the word embedding models to initialize the text features extractor in the recurrent neural networks. The performance is compared with that of the model with or without text data augmentation. On the loss function part, we choose the triplet ranking loss with hard negative mining. Compared with the other models on Flickr8k, Flickr30k, and MS-COCO datasets, the experiments demonstrate that our proposed visual semantic embedding learning framework performs better in tasks such as image annotation and image retrieval. Besides, we also analyse the influences on the model of the percentage of the training set used in model learning and the word embedding initialization. The above experiments also prove that text data augmentation is more suitable for small datasets and word embedding initialization is more suitable for larger ones.

Data Availability

The experimental datasets used in this work are publicly available, and the bundled data and code of this work are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the development and application of "IOT +" maker general prototype platform project of science and technology research program of Chongqing Education Commission of China (No. KJQN201803310), the intelligent detection and location system of noncooperative targets based on hyperspectral video images project of science and technology research program of Chongqing Education Commission of China (No. KJQN201803308), and project of Research Innovation Team of Chongqing City Management College (No. KYTD202006).