Introduction
-
We propose a novel architecture to generate a sentence from multi-perspective lifelog images capturing the same moments in a human–robot symbiotic environment.
-
We construct a new dataset composed of synchronized multi-perspective image sequences that are annotated with natural language descriptions for each sequences.
-
We conduct experiments of caption generation on perspective ablation settings and demonstrate that our approach achieves significant improvements on common metrics in the image caption generation task.
Related work
Visual lifelogging
Image captioning
Datasets
Models
ShowTell
: Vinyals et al. [18] analogized the automatic generation of image captions as “machine translation from an image to a description” and succeeded to train deep networks that generate a template-free caption from a given image. The approach extended the encoder–decoder model proposed in the machine translation field wherein an image feature is abstracted by a pretrained CNN encoder and the sequential likelihoods of vocabulary words are predicted step by step from an RNN decoder. Model training is formulated to maximize the cumulative log-likelihood of reference captions.ShowAttendTell
: In addition, Xu et al. [19] applied an attention mechanism that was proposed in machine translation field to improve the word alignments between the source and target languages. The attention mechanism is a type of a dictionary model that has a set of feature candidates from an encoder. At each decoding step, the input features are adaptively selected with hard/soft weights computed by the top-down signal from a decoder. In this manner, the model can efficiently propagate the source information to predict the sequential results. In the context of the image captioning task, the attention mechanism receives the CNN feature maps as grid features so that it bridges specific image regions and prediction of each word in a caption. This approach has significantly improved caption quality.UpDown
: More recently, Anderson et al. [20] proposed a novel approach where the attention mechanism receives a set of region-of-interest (ROI) features as candidates, instead of the grid features as in ShowAttendTell
. With this modeling, the visual concepts in both foreground and background appear in the image are encoded as object features while the ShowAttendTell
model possibly disassembles them into several grid features. In this manner, the context and the relationships of salient objects can be accurately reflected in the caption.ShowTell
model that takes an image as a global feature to be used in the caption decoder, the attention-based models, ShowAttendTell
and UpDown
, can pool multiple feature candidates that are spatial grids or salient regions. Our key idea for multi-perspective image captioning is “attend to fuse”, that is, to organize the feature candidates across complementary multiple images acquired from a human–robot symbiotic environment. Moreover, we assume that the attention module pretrained with captions per image using a dataset such as COCO [21] generalizes to the feature candidates scattered on images that have different perspectives.Fourth-person vision
First-person vision
Second-person vision
Third-person vision
Fourth-person vision
Our approach
UpDown
by Anderson et al. [20]. The architecture is illustrated in Fig. 3. UpDown
first enumerates salient regions within a given image, encodes the spatial feature into a fixed-size vector per region Sect. (“Image encoding” ), and feeds them into the captioning process with an attention mechanism Sect. (“Caption generation”). For our multi-perspective situation, the region features are given from each perspective and are fed into the captioning process as attention candidates to decode words. In this study, we especially focus on how we can reorganize the attention candidates from the multi-perspective images. We propose a bottom-up fusion step that clusters the salient region features to suppress the appearance of the identical instances over multiple viewpoints Sect. (“Salient region clustering”).
Image encoding
UpDown
[20], to detect objects or salient regions as bounding box assigned to 1,600 object classes and their 400 attribute classes of Visual Genome [23]. The detected raw regions are processed with non-maximum suppression to filter out overlapping, and for each selected region the mean-pooled feature is extracted from the penultimate layer of the object/attribute classifiers as \(\varvec{v}_{\varvec{i}}\). Each feature \(\varvec{v}_{\varvec{i}}\) represents high-level semantic information about the partial scene of the image.
Salient region clustering
Ensemble
), since the attention module can implicitly fuse the correlated ROIs responding to the same top-down signal from the decoder. However, the implicit fusion by the top-down signal may result in biased weights on repetitively occurred objects or may fail to co-occur the ROI features of the identical object due to the subtle difference in the subspace.
Caption generation
UpDown
[20] to generate a sequence of words \(S=\{\varvec{s}_{\varvec{1}}, \ldots , \varvec{s}_{\varvec{T}}\}\) from a set of attention candidate vectors \({\tilde{V}}\). The words \(\varvec{s}_{\varvec{t}}\) are represented with one-hot vectors where the dimension is equal to the number of vocabulary words K. The decoder is composed of two stacked long short-term memories (LSTMs) and an attention module, as shown in Fig. 5. At timestep t, each LSTM updates its hidden state \(\varvec{h}_{\varvec{t}}\) given previous hidden state \(\varvec{h}_{\varvec{t}-\varvec{1}}\) and an input vector \(\varvec{x}_{\varvec{t}}.\)Experiments
Dataset
Evaluation metrics
Implementation details
UpDown
is trained on pairs of an image and five captions in the training split. The captions do not have any punctuation and are unified in lower case. The vocabulary is pruned by defining any words that have a count less than five as a particular \(\texttt {<unknown>}\) word. The final vocabulary comprises 10,010 words. We use our PyTorch re-implementation of UpDown
model originally written in Caffe [20]. For the Faster R-CNN detector, we choose ResNet-101 [32] as a backbone and perform ROI pooling on pool5
feature maps so as to encode each region into a 2048-D vector. For each image, we select up to 100 candidate regions according to the predicted scores. Since the number of candidate regions is small, the clustering step had little effect on the whole process time in our experiments. Following the baseline [20], The model is trained with the criterion of minimizing cross-entropy of reference captions and Self-Critical Sequence Training (SCST) [33] that directly optimizes the CIDEr scores of sampled captions. We perform beam search decoding with a beam width of 5 until reaching the end token or the maximum length of 20. We restrict the occurred times of words to one except for stop words.Quantitative analysis
Perspective ablation
Ensemble-123
denotes the ensemble features from the first-, second-, and third-person perspective images. All generated captions are individually evaluated with reference captions.
UpDown
[20] is a baseline method which inputs a single-perspective image. Ensemble
is a method that bundles attention candidates from two or three perspectives and inputs them to the UpDown
decoder. KMeans
is our proposed method to construct attention candidates by clustering multi-perspective ROI features into k groups beforehand and then input the k centroids to UpDown
decoder. We initially set 32 to the number of clusters k, which is close to the best number of attention candidates reported in UpDown
results on COCO dataset [20]. The centroids are initialized with k-means++ [34] algorithm and iteratively updated until converged. As seen in Table 1, KMeans-123
, a proposed method for inputting images from three perspectives shows the best scores in all evaluation indices. Focusing on double input models (middle), the score of the proposed KMeans
is higher than that of Ensemble
for any input combination, indicating the effectiveness of bottom-up clustering. We note that the SPICE scores of UpDown-1
and UpDown-2
are close; however, even Ensemble-12
that simply combines the first- and second-person images boosts the performance. It can be considered that each perspective has complemental visual cues to generate actor-related descriptions. This observation can also be seen the most in the case of the third-person perspective although the image itself is confusing to exclusively extract the actor’s features. Another important observation is that Ensemble-12
is better than Ensemble-123
while KMeans-123
performs better than KMeans-12
with the adoption of the bottom-up fusion clustering. Input perspective | Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE-L | METEOR | CIDEr-D | SPICE | ||
---|---|---|---|---|---|---|---|---|---|---|---|
First | UpDown [20] | 51.20 | 33.47 | 20.41 | 11.25 | 38.85 | 17.45 | 21.44 | 12.19 | ||
Second | UpDown [20] | 60.86 | 43.24 | 31.12 | 21.19 | 45.60 | 19.46 | 16.94 | 12.08 | ||
Third | UpDown [20] | 42.80 | 26.56 | 16.17 | 9.70 | 31.34 | 13.73 | 6.79 | 6.28 | ||
Second | Third | Ensemble | 59.14 | 41.97 | 30.45 | 21.06 | 44.09 | 19.13 | 15.18 | 11.40 | |
Second | Third | KMeans | 62.31 | 45.34 | 33.16 | 22.91 | 46.22 | 20.19 | 17.76 | 12.21 | |
First | Third | Ensemble | 59.06 | 42.78 | 30.47 | 20.28 | 45.16 | 20.33 | 27.71 | 14.37 | |
First | Third | KMeans | 60.83 | 44.71 | 32.03 | 21.48 | 46.27 | 21.16 | 30.10 | 15.02 | |
First | Second | Ensemble | 62.08 | 45.37 | 32.82 | 22.47 | 47.67 | 21.68 | 30.03 | 15.04 | |
First | Second | KMeans | 62.43 | 45.78 | 32.90 | 22.19 | 47.61 | 21.87 | 30.76 | 15.24 | |
First | Second | Third | Ensemble | 63.12 | 46.37 | 34.08 | 23.71 | 47.92 | 21.72 | 29.52 | 14.99 |
First | Second | Third | KMeans | 65.09 | 48.93 | 36.02 | 24.78 | 49.13 | 22.79 | 33.41 | 15.72 |
KMeans-123
and KMeans-23
. For precision-based BLEU metrics, the second-person perspective shows the highest rates. Notably, for the other metrics, the first-person perspective shows the highest rates. It can be considered that the second-person images explicitly and exclusively capture the actor’s scenes to generate an actor-wise description, but the other important visual cues reside on the other perspective images.
KMeans-123
and KMeans-23
Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE-L | METEOR | CIDEr-D | SPICE | |
---|---|---|---|---|---|---|---|---|---|
First-person | KMeans | +4.5 | +7.9 | +8.6 | +8.1 | +6.3 | +12.9 | +88.1 | +28.7 |
Second-person | KMeans | +7.0 | +9.4 | +12.4 | +15.4 | +6.2 | +7.7 | +11.0 | +4.7 |
Third-person | KMeans | +4.2 | +6.9 | +9.5 | +11.7 | +3.2 | +4.2 | +8.6 | +3.1 |
Detailed results of SPICE
KMeans-123
still outperforms in the object category; in other categories, single- or double-input models are better. The first-person model is remarkably superior to others in the color category. That indicates the joint attention of ROI features across multiple perspectives possibly obscures the detailed visual concepts, meanwhile stably generating captions that includes important concepts such as the actor or other items. Input Perspective | Method | SPICE (All) | Object | Relation | Attribute | Color | Count | Size | ||
---|---|---|---|---|---|---|---|---|---|---|
First | UpDown [20] | 12.19 | 26.36 | 1.42 | 3.52 | 2.38 | 0.00 | 0.00 | ||
Second | UpDown [20] | 12.08 | 26.66 | 1.30 | 1.45 | 0.02 | 0.00 | 0.00 | ||
Third | UpDown [20] | 6.28 | 14.62 | 0.46 | 0.17 | 0.04 | 0.11 | 0.00 | ||
Second | Third | Ensemble | 11.40 | 25.42 | 1.13 | 0.87 | 0.00 | 0.00 | 0.00 | |
Second | Third | KMeans | 12.21 | 27.40 | 1.08 | 0.86 | 0.00 | 0.00 | 0.00 | |
First | Third | Ensemble | 14.37 | 30.48 | 2.15 | 3.30 | 0.17 | 0.00 | 0.00 | |
First | Third | KMeans | 15.02 | 32.02 | 2.13 | 3.16 | 0.15 | 0.00 | 0.00 | |
First | Second | Ensemble | 15.04 | 31.96 | 1.98 | 3.63 | 0.04 | 0.00 | 0.00 | |
First | Second | KMeans | 15.24 | 32.56 | 1.90 | 3.41 | 0.14 | 0.00 | 0.00 | |
First | Second | Third | Ensemble | 14.99 | 32.02 | 2.01 | 3.35 | 0.02 | 0.00 | 0.00 |
First | Second | Third | KMeans | 15.72 | 33.74 | 1.96 | 3.18 | 0.05 | 0.00 | 0.00 |
Clustering setting
KMeans-123
. For clustering algorithms, we compare basic k-means [25] that produces k centroids, x-means [35] that adaptively subdivides the clusters under the Bayesian information criterion, and k-medoids [36] that update medoids instead of centroids. All algorithms are initialized with the k-means++ [34] algorithm. For the number of clusters, we sweep \(\{4, 8, 16, 32, 64\}\) for each algorithm. We note that only x-means may increase the number in updating, and all algorithms could be equivalent to Ensemble-123
when the number of clusters reaches the number of attention candidates. Fig. 9 shows CIDEr-D and SPICE scores on various methods of clustering algorithms and Ensemble-123
without clustering. For both metrics, the peaks are at \(k=32\) in k-means and x-means algorithms, while the k-medoids algorithm that “prunes” attention candidates reduces both scores as the number of clusters decreases. Although k-medoids algorithm is known as robust to noises and outliers, each selected medoid is subject to only one salient region while the other algorithms use centroids in which clusters are averaged. We see that indicating roughly grouped centroids still preserve the visual features and that enables us to encourage implicit joint attention across perspectives and to reflect in the captions.
Temporal batch clustering
KMeans-123
and Ensemble-123
where the attention candidates are aggregated across the consecutive frames and not just only perspectives. For the number of pooled frames, we sweep \(\{1, 2, 4, 8, 16, 32\}\) for both approach. As seen in Fig. 10, both approaches boost the performance as the number of frames increase.
Qualitative analysis
Generated captions
UpDown-1
), second-person (UpDown-2
), third-person models (UpDown-3
), and our fourth-person model (KMeans-123
). The bottom two examples show interactive scenes where the participants are handing over household items. The first-person perspective captions tend to briefly describe the relationship of the actor and objects manipulated by hands, whereas there is no explicit phrase about context such as the type of place. In contrast, the second-person perspective captions succeed at describing where the actor is and his/her postures, which are invisible concepts from a first-person perspective. However, in some cases, the types of activities are not clear due to their visual granularity. The third-person perspective captions are more ambiguous in terms of the participants’ situation, while novel objects not visible in other perspectives are described. Finally, our proposed approach generates more detailed captions about the actor and the context. For instance, on the top left figure, the caption by our method includes the actor’s posture, location, and detailed activity, which could only be described partially through single perspective cases. Moreover, in interactive cases on the bottom, we can see that our proposed method improves the third-person captions with the additional phrase about the manipulated objects derived from the first-person and/or the second-person image sources. Qualitatively, we see that the critical visual concepts are potentially in the first- and second-person perspectives, while the third-person perspective contributes to describe the interactive scenes. Although the verbal expression is slightly different for each perspective, we can see our method successfully produced reasonable description integrating three types of perspectives semantically. It can be considered that our clustering on ROI feature space effectively works to summarize the multi-perspective images.
Visualizing ROI attention
Ensemble-123
and KMeans-123
. We can see remarkable differences in predicting human-related words. Ensemble-123
model incorrectly focuses on different people or either of which, while our KMeans-123
model successfully focuses on the same person in the second- and third-person images. That indicates our proposed bottom-up approach is effective to improve instance-correspondence in the top-down weighting.