Weakly-paired deep dictionary learning for cross-modal retrieval
Introduction
Cross-modal retrieval task aims to search relevant samples of some modality for a query of another modality and has received extensive attentions from the communities of Internet, robotics, and so on [4], [14]. The main challenges of cross-modal retrieval lie in the heterogeneous gap [16]. Since data of different modalities resides in different feature spaces, traditional uni-modal approaches cannot be directly applied. Hence, how to capture and correlate heterogeneous features from different modalities is the key to tackle the cross-modal learning problem [5]. During the past decades, many methods have been proposed to learn the multi-modal data’s common representation, which provides the basis for similarity search across different modalities. To name a few, Canonical Correlation Analysis (CCA) method and its extensions learn two directions for two modalities, along which the data is maximally correlated. Wang et al. [29] presented a comprehensive survey on the cross-modal retrieval. Recently, a common recognition is that incorporating class labels or group structure to learn the low-dimensional subspaces is beneficial for cross-modal retrieval tasks, see [11], [22] and [33] for examples.
Most existing heterogeneous cross-modal learning methods rely on sufficient multi-modal sample correspondences to learn a mapping across heterogeneous feature spaces, and assume that such correspondences are given in advance [17]. Some recent work such as Kang et al. [8] and Shen et al. [24] relaxed this assumption by utilizing the semi-paired data, i.e. some of data are paired but the rest are not paired. In many practical web search scenarios, getting equal number of samples per object in different modalities may not be always possible, and the number of training samples per object is taken to be different for the multiple modalities. For example, suppose the task is to match visual images to text sentences, and in the training period, 100 images and 50 sentences of each concept are available. So obviously, there is no sample-to-sample correspondence between the samples of different modalities, rather classes of samples in one modality correspond to classes of samples in the other modality. For robotic applications, when using different types of sensors, it is common to obtain different numbers of samples from them (see [9]). For material recognition, Liu et al. [12] presented a practical example which combined the un-paired images, tactile and sound signals. In those cases, sample correspondences between modalities are often totally unknown or difficult to obtain, and the data coming from the multiple modalities may not be paired at all. Therefore, a more practical assumption is that there is no sample-to-sample correspondence between modalities. Rather, there is correspondence between the classes of samples of different modalities. In this work, we call this case as weakly-paired multi-modal data, of which definition has been given in Lampert and Krömer [9].
Unfortunately, only a few previous work had exploited on this scenario. Among which Lampert and Krömer [9] proposed weakly-paired maximum covariance analysis for multi-modal dimensionality reduction. Rasiwasia et al. [22] developed Cluster Canonical Correlation Analysis (C-CCA) for joint dimensionality reduction of two sets of weakly-paired data points. This method requires calculating correlation between any two within-class samples between modalities, leading to poor scalability for more modalities. In addition, Ranjan et al. [21] addressed the multi-label CCA problem and Mandal et al. [19] developed a semantic preserving Hashing technology which could deal with multi-label weakly-paired cross-modal retrieval problem. Aytar et al. [1] used only annotations of scene categories to learn a deep aligned cross-modal scene representation. It can deal with more than 2 modalities. However, learning intermediate modal-invariant representations across different modalities with minimal supervision is still an unsolved problem. Very recently, Wang et al. [28] proposed an adversarial learning method to learn representations which are both discriminative and modal-invariant. This method still requires the sample pairing across modalities.
On the other hand, to sufficiently represent heterogeneous features residing in different feature spaces, dictionary learning has attracted ever-increasing attention [3], [15]. Dictionary learning aims to learn a representation space from a set of training examples, where a given sample can be approximately represented as a sparse or dense code for later processing [30], [31]. As one of the most popular cross-modal learning methods, the coupled dictionary learning method in Zhuang et al. [35] and Deng et al. [6] develops modal-specific dictionaries to map the paired multi-modal data into a common representation space which captures the correlated features. To deal with the weakly-paired data, a recent work Mandal and Biswas [18] incorporated the C-CCA developed in Rasiwasia et al. [22] into the dictionary learning.
Single shallow level of dictionary learning yields a coarse latent representation of data. Some scholars proposed to learn latent representation of data by learning multi-level dictionaries. The idea of learning deeper levels of dictionaries stems from the recent success of deep learning and has attracted attention in the recent years. Shen et al. [23] proposed a hierarchical discriminative dictionary learning approach for visual categorization. They learned multiple dictionaries at different layers to capture varying scale information, which also includes encoding information from previous layers. Trigeorgis et al. [27] developed deep non-negative matrix factorization model for learning rich hidden representation. In Tariyal et al. [25], a greedy learning algorithm for the deep dictionary training framework was developed. The deep dictionary learning architecture has shown excellent performance in silicone mask detection ([20]), social image understanding [10], cross-domain recommendation [7]. Since deep dictionary provides more flexible structure to capture the rich intrinsic information, using it to exploit the intrinsic relation of multi-modal data is very appealing. Very recently, Zhao et al. [34] developed multi-view clustering via paired deep matrix factorization, which did not apply for weakly-paired cross-modal learning.
In this work, our focus is learning cross-modal representations when the modalities are significantly different (e.g., text and natural images) and with minimal supervision (e.g., class label). That is to say, no correspondences between samples are available. This problem is very challenging due to the lack of paired sample. Inspired by the deep structure in Trigeorgis et al. [27] and Zhao et al. [34], we develop a new deep dictionary learning method with some extra structural constraints to exploit more intrinsic structure in the available weakly-paired multi-modal data. The main contributions of this work are summarized as follows.
- 1.
A scalable hierarchical learning architecture is proposed to deal with the extensive weakly-paired heterogeneous multi-modal data. It can be easily used to deal with cases of more than two modalities.
- 2.
A shared classifier across different modalities in the learned representation space is imposed to utilize the label information. This strategy effectively deals with minimal supervision information for the weakly-paired multi-modal data.
- 3.
A multi-modal low-rank model is introduced to characterize the within-class similarity in the learned representation space, which enables us to impose the low-rank constraint across modalities and obtain the modal-invariant representation.
The rest of this paper is organized as follows: Section 2 formulates the problem. The optimization algorithm is illustrated in Section 3. Section 4 presents the experimental results.
Notations: We denote the normalized dictionary set as . For convenience, we occasionally neglect the superscripts and use when the dimensions can be inferred from the contexts.
Section snippets
Problem formulation
Suppose we have M modalities, each of which has the same G classes. The number of training samples for the m-th modality is where is the number of the training samples of the g-th class within the m-th modality. For the m-th modality, we denote the data sample matrix aswhere represents a set of dm-dimensional training samples from the m-th modality labelled with the g-th class. The label matrix for the m-th modality is denoted as
Deep dictionary learning method
Due to their importance, the representation vectors in C(m) should be developed by exploiting sufficient structure information of the multi-modal data. First, they should faithfully reflect the characteristics of each modality, and therefore exhibit the modal-specific component. Secondly, they should incorporate the modal-invariant shared representation across all modalities. Thirdly, the discriminative capability should be incorporated into the representation vectors.
To achieve the goal, we
Cross-modal retrieval
For a query sample of the m-th modality, the goal is to search the most relevant samples in the gallery set of the n-th modality where NS is the number of samples in the gallery set.
First we collapse multiple levels of dictionaries into single ones asand then we obtain the representation vector z(m) ∈ RK for the query sample q(m), and the representation vectors for the gallery set G(n), by solving the optimization
Optimization algorithm
Learning all the dictionaries, the coding vectors and the classifier matrix simultaneously makes the problem of (5) highly non-convex. Similar to Thiagarajan et al. [26] and Zhao et al. [34], we develop a two-stage learning algorithm which consists of pre-training and fine-tuning stages.
Dataset
The proposed method is suitable for dealing with cases of more than two modalities. However, since most of the compared methods can deal with two modalities only. We limit the evaluations on the following popular two-modality datasets:
- 1.
Wiki dataset consists of 2,173/693 (training/testing) image-text pairs labelled by 10 categories, we use the 500-d BoVW image feature and 1,000-d text BoW feature.
- 2.
Pascal VOC dataset consists of 2,808/2,841 (training/ testing) image-tag pairs labelled by 20
Conclusions
In this work, we establish a deep dictionary learning framework for weakly-paired multi-modal data. The developed algorithm joints deep structure learning, low-rank constraint and shared classifier into a unified framework in order to capture effective information shared by multiple modalities.
Conflict of interest
We confirm that there are no conflicts of interest associated with this submission and there has been no significant financial support for this work that could have influenced its outcome.
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China under Grant U1613212, Grant 61673238, in part by the Beijing Municipal Science and Technology Commission under Grant D171100005017002, and in part by the National High Technology Research and Development Program of China under Grant 2016YFB0100903.
References (35)
- et al.
Simple to complex cross-modal learning to rank
Comput. Vision Image Understanding
(2017) - et al.
Learning multiple diagnosis codes for ICU patients with local disease correlation mining
ACM Trans. Knowl. Discovery Data (TKDD)
(2017) - et al.
Cross-modal scene networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - et al.
Multimodal task-driven dictionary learning for image classification
IEEE Trans. Image Process.
(2016) - et al.
Feature interaction augmented sparse learning for fast kinect motion detection
IEEE Trans. Image Process.
(2017) - et al.
Bi-level semantic representation analysis for multimedia event detection
IEEE Trans. Cybern.
(2017) - et al.
Semantic pooling for complex event analysis in untrimmed videos
IEEE Trans. Pattern Anal. Mach. Intell.
(2017) - et al.
Discriminative dictionary learning with common label alignment for cross-modal retrieval
IEEE Trans. Multimedia
(2016) - et al.
Deep low-rank sparse collective factorization for cross-domain recommendation
Proceedings of the 2017 ACM on Multimedia Conference
(2017) - et al.
Learning consistent feature representation for cross-modal multimedia retrieval
IEEE Trans. Multimedia
(2015)
Weakly-paired maximum covariance analysis for multimodal dimensionality reduction and transfer learning
European Conference on Computer Vision
Weakly supervised deep matrix factorization for social image understanding
IEEE Trans. Image Process.
Group-invariant cross-modal subspace learning.
IJCAI
Multimodal measurements fusion for surface material categorization
IEEE Trans. Instrum. Meas.
Low-rank multi-view learning in matrix completion for multi-label image classification.
AAAI
Adaptive unsupervised feature selection with structure regularization
IEEE Trans. Neural Netw. Learn Syst.
Joint attributes and event analysis for multimedia event detection
IEEE Trans. Neural Netw. Learn Syst.
Cited by (12)
CNN‐Transformer for visual‐tactile fusion applied in road recognition of autonomous vehicles
2023, Pattern Recognition LettersCitation Excerpt :As shown in Fig. 5, there are two modality-specific networks located at the entry of the overall architecture. It should be noted that the input data pair, including the road image and the corresponding PVDF signal spectrogram, is weakly paired matching [33]. Moreover, the different modalities present high discrepancy, leading to inconsistent data distribution and representation.
Hybrid SOM based cross-modal retrieval exploiting Hebbian learning
2022, Knowledge-Based SystemsCitation Excerpt :Various multi-modal datasets suffer from a weak-pairing issue where one concept of data samples in one mode corresponds to the same concept of data samples in another mode rather than direct sample-to-sample correspondence among modalities which introduces a challenge in cross-modal retrieval. Authors in [31] have proposed a novel scalable hierarchical learning framework dubbed deep dictionary learning (DDL) to handle this challenge which considers the cross-modal representations without direct correspondence and minimal concept label supervision. For dealing with label supervision, a shared classifier is introduced across diverse modalities and for modal invariant representation, a multi-modal low-rank model is proposed.
CHOP: An orthogonal hashing method for zero-shot cross-modal retrieval
2021, Pattern Recognition LettersCitation Excerpt :Due to the cross-modal data (such as images and texts) have significantly different statistical properties (heterogeneous gap) [1], it is impossible to measure the similarity in a direct way. To this end, cross-modal retrieval has been widely investigated recently whose main idea is learning a common representation for various modalities [2–4]. Due to the impressive performance on low storage cost and high retrieval speed, the cross-modal approaches based on hashing have drawn considerable attention [5–8].
Hierarchical Locality-Aware Deep Dictionary Learning for Classification
2024, IEEE Transactions on MultimediaFast re-OBJ: real-time object re-identification in rigid scenes
2022, Machine Vision and ApplicationsTraffic prediction using MSSBiLS with self-attention model
2022, Concurrency and Computation: Practice and Experience