Elsevier

Pattern Recognition Letters

Volume 130, February 2020, Pages 199-206
Pattern Recognition Letters

Weakly-paired deep dictionary learning for cross-modal retrieval

https://doi.org/10.1016/j.patrec.2018.06.021Get rights and content

Highlights

  • A hierarchical learning architecture is proposed to deal with the weakly-paired heterogeneous multi-modal data.

  • A shared classifier across different modalities in the learned representation space is imposed to utilize the label information.

  • A low-rank model is introduced to characterize the within-class similarity in the learned representation space.

Abstract

Many multi-modal data suffers from significant weak-pairing characteristics, i.e., there is no sample-to-sample correspondence between modalities, rather classes of samples in one modality correspond to classes of samples in the other modality. This provides great challenges for the cross-modal learning for retrieval. In this work, our focus is learning cross-modal representations with minimal class label supervision and without correspondences between samples. To tackle this challenging problem, we establish a scalable hierarchical learning architecture to deal with the extensive weakly-paired heterogeneous multi-modal data. A shared classifier across different modalities is used to effectively deal with label supervision information, and a multi-modal low-rank model is introduced to encourage the modal-invariant representation. Finally, some cross-modal validations on publicly available datasets are performed to show the advantages of the proposed method.

Introduction

Cross-modal retrieval task aims to search relevant samples of some modality for a query of another modality and has received extensive attentions from the communities of Internet, robotics, and so on [4], [14]. The main challenges of cross-modal retrieval lie in the heterogeneous gap [16]. Since data of different modalities resides in different feature spaces, traditional uni-modal approaches cannot be directly applied. Hence, how to capture and correlate heterogeneous features from different modalities is the key to tackle the cross-modal learning problem [5]. During the past decades, many methods have been proposed to learn the multi-modal data’s common representation, which provides the basis for similarity search across different modalities. To name a few, Canonical Correlation Analysis (CCA) method and its extensions learn two directions for two modalities, along which the data is maximally correlated. Wang et al. [29] presented a comprehensive survey on the cross-modal retrieval. Recently, a common recognition is that incorporating class labels or group structure to learn the low-dimensional subspaces is beneficial for cross-modal retrieval tasks, see [11], [22] and [33] for examples.

Most existing heterogeneous cross-modal learning methods rely on sufficient multi-modal sample correspondences to learn a mapping across heterogeneous feature spaces, and assume that such correspondences are given in advance [17]. Some recent work such as Kang et al. [8] and Shen et al. [24] relaxed this assumption by utilizing the semi-paired data, i.e. some of data are paired but the rest are not paired. In many practical web search scenarios, getting equal number of samples per object in different modalities may not be always possible, and the number of training samples per object is taken to be different for the multiple modalities. For example, suppose the task is to match visual images to text sentences, and in the training period, 100 images and 50 sentences of each concept are available. So obviously, there is no sample-to-sample correspondence between the samples of different modalities, rather classes of samples in one modality correspond to classes of samples in the other modality. For robotic applications, when using different types of sensors, it is common to obtain different numbers of samples from them (see [9]). For material recognition, Liu et al. [12] presented a practical example which combined the un-paired images, tactile and sound signals. In those cases, sample correspondences between modalities are often totally unknown or difficult to obtain, and the data coming from the multiple modalities may not be paired at all. Therefore, a more practical assumption is that there is no sample-to-sample correspondence between modalities. Rather, there is correspondence between the classes of samples of different modalities. In this work, we call this case as weakly-paired multi-modal data, of which definition has been given in Lampert and Krömer [9].

Unfortunately, only a few previous work had exploited on this scenario. Among which Lampert and Krömer [9] proposed weakly-paired maximum covariance analysis for multi-modal dimensionality reduction. Rasiwasia et al. [22] developed Cluster Canonical Correlation Analysis (C-CCA) for joint dimensionality reduction of two sets of weakly-paired data points. This method requires calculating correlation between any two within-class samples between modalities, leading to poor scalability for more modalities. In addition, Ranjan et al. [21] addressed the multi-label CCA problem and Mandal et al. [19] developed a semantic preserving Hashing technology which could deal with multi-label weakly-paired cross-modal retrieval problem. Aytar et al. [1] used only annotations of scene categories to learn a deep aligned cross-modal scene representation. It can deal with more than 2 modalities. However, learning intermediate modal-invariant representations across different modalities with minimal supervision is still an unsolved problem. Very recently, Wang et al. [28] proposed an adversarial learning method to learn representations which are both discriminative and modal-invariant. This method still requires the sample pairing across modalities.

On the other hand, to sufficiently represent heterogeneous features residing in different feature spaces, dictionary learning has attracted ever-increasing attention [3], [15]. Dictionary learning aims to learn a representation space from a set of training examples, where a given sample can be approximately represented as a sparse or dense code for later processing [30], [31]. As one of the most popular cross-modal learning methods, the coupled dictionary learning method in Zhuang et al. [35] and Deng et al. [6] develops modal-specific dictionaries to map the paired multi-modal data into a common representation space which captures the correlated features. To deal with the weakly-paired data, a recent work Mandal and Biswas [18] incorporated the C-CCA developed in Rasiwasia et al. [22] into the dictionary learning.

Single shallow level of dictionary learning yields a coarse latent representation of data. Some scholars proposed to learn latent representation of data by learning multi-level dictionaries. The idea of learning deeper levels of dictionaries stems from the recent success of deep learning and has attracted attention in the recent years. Shen et al. [23] proposed a hierarchical discriminative dictionary learning approach for visual categorization. They learned multiple dictionaries at different layers to capture varying scale information, which also includes encoding information from previous layers. Trigeorgis et al. [27] developed deep non-negative matrix factorization model for learning rich hidden representation. In Tariyal et al. [25], a greedy learning algorithm for the deep dictionary training framework was developed. The deep dictionary learning architecture has shown excellent performance in silicone mask detection ([20]), social image understanding [10], cross-domain recommendation [7]. Since deep dictionary provides more flexible structure to capture the rich intrinsic information, using it to exploit the intrinsic relation of multi-modal data is very appealing. Very recently, Zhao et al. [34] developed multi-view clustering via paired deep matrix factorization, which did not apply for weakly-paired cross-modal learning.

In this work, our focus is learning cross-modal representations when the modalities are significantly different (e.g., text and natural images) and with minimal supervision (e.g., class label). That is to say, no correspondences between samples are available. This problem is very challenging due to the lack of paired sample. Inspired by the deep structure in Trigeorgis et al. [27] and Zhao et al. [34], we develop a new deep dictionary learning method with some extra structural constraints to exploit more intrinsic structure in the available weakly-paired multi-modal data. The main contributions of this work are summarized as follows.

  • 1.

    A scalable hierarchical learning architecture is proposed to deal with the extensive weakly-paired heterogeneous multi-modal data. It can be easily used to deal with cases of more than two modalities.

  • 2.

    A shared classifier across different modalities in the learned representation space is imposed to utilize the label information. This strategy effectively deals with minimal supervision information for the weakly-paired multi-modal data.

  • 3.

    A multi-modal low-rank model is introduced to characterize the within-class similarity in the learned representation space, which enables us to impose the low-rank constraint across modalities and obtain the modal-invariant representation.

The rest of this paper is organized as follows: Section 2 formulates the problem. The optimization algorithm is illustrated in Section 3. Section 4 presents the experimental results.

Notations: We denote the normalized dictionary set as Dd×K={DRd×K|D(:,k)21,k{1,2,,K}}. For convenience, we occasionally neglect the superscripts and use D when the dimensions can be inferred from the contexts.

Section snippets

Problem formulation

Suppose we have M modalities, each of which has the same G classes. The number of training samples for the m-th modality is Nm=g=1GNgm, where Ngm is the number of the training samples of the g-th class within the m-th modality. For the m-th modality, we denote the data sample matrix asX(m)=[X1(m),X2(m),,XG(m)]Rdm×Nm,where Xg(m)Rdm×Ngm represents a set of dm-dimensional training samples from the m-th modality labelled with the g-th class. The label matrix for the m-th modality is denoted as Y

Deep dictionary learning method

Due to their importance, the representation vectors in C(m) should be developed by exploiting sufficient structure information of the multi-modal data. First, they should faithfully reflect the characteristics of each modality, and therefore exhibit the modal-specific component. Secondly, they should incorporate the modal-invariant shared representation across all modalities. Thirdly, the discriminative capability should be incorporated into the representation vectors.

To achieve the goal, we

Cross-modal retrieval

For a query sample q(m)Rd(m) of the m-th modality, the goal is to search the most relevant samples in the gallery set of the n-th modality G(n)Rd(n)×NS, where NS is the number of samples in the gallery set.

First we collapse multiple levels of dictionaries into single ones asD(m)=D1(m)D2(m)DL(m),D(n)=D1(n)D2(n)DL(n),and then we obtain the representation vector z(m) ∈ RK for the query sample q(m), and the representation vectors H(n)RK×NS for the gallery set G(n), by solving the optimization

Optimization algorithm

Learning all the dictionaries, the coding vectors and the classifier matrix simultaneously makes the problem of (5) highly non-convex. Similar to Thiagarajan et al. [26] and Zhao et al. [34], we develop a two-stage learning algorithm which consists of pre-training and fine-tuning stages.

Dataset

The proposed method is suitable for dealing with cases of more than two modalities. However, since most of the compared methods can deal with two modalities only. We limit the evaluations on the following popular two-modality datasets:

  • 1.

    Wiki dataset consists of 2,173/693 (training/testing) image-text pairs labelled by 10 categories, we use the 500-d BoVW image feature and 1,000-d text BoW feature.

  • 2.

    Pascal VOC dataset consists of 2,808/2,841 (training/ testing) image-tag pairs labelled by 20

Conclusions

In this work, we establish a deep dictionary learning framework for weakly-paired multi-modal data. The developed algorithm joints deep structure learning, low-rank constraint and shared classifier into a unified framework in order to capture effective information shared by multiple modalities.

Conflict of interest

We confirm that there are no conflicts of interest associated with this submission and there has been no significant financial support for this work that could have influenced its outcome.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant U1613212, Grant 61673238, in part by the Beijing Municipal Science and Technology Commission under Grant D171100005017002, and in part by the National High Technology Research and Development Program of China under Grant 2016YFB0100903.

References (35)

  • M. Luo et al.

    Simple to complex cross-modal learning to rank

    Comput. Vision Image Understanding

    (2017)
  • S. Wang et al.

    Learning multiple diagnosis codes for ICU patients with local disease correlation mining

    ACM Trans. Knowl. Discovery Data (TKDD)

    (2017)
  • Y. Aytar et al.

    Cross-modal scene networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • S. Bahrampour et al.

    Multimodal task-driven dictionary learning for image classification

    IEEE Trans. Image Process.

    (2016)
  • X. Chang et al.

    Feature interaction augmented sparse learning for fast kinect motion detection

    IEEE Trans. Image Process.

    (2017)
  • X. Chang et al.

    Bi-level semantic representation analysis for multimedia event detection

    IEEE Trans. Cybern.

    (2017)
  • X. Chang et al.

    Semantic pooling for complex event analysis in untrimmed videos

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • C. Deng et al.

    Discriminative dictionary learning with common label alignment for cross-modal retrieval

    IEEE Trans. Multimedia

    (2016)
  • S. Jiang et al.

    Deep low-rank sparse collective factorization for cross-domain recommendation

    Proceedings of the 2017 ACM on Multimedia Conference

    (2017)
  • C. Kang et al.

    Learning consistent feature representation for cross-modal multimedia retrieval

    IEEE Trans. Multimedia

    (2015)
  • C.H. Lampert et al.

    Weakly-paired maximum covariance analysis for multimodal dimensionality reduction and transfer learning

    European Conference on Computer Vision

    (2010)
  • Z. Li et al.

    Weakly supervised deep matrix factorization for social image understanding

    IEEE Trans. Image Process.

    (2017)
  • J. Liang et al.

    Group-invariant cross-modal subspace learning.

    IJCAI

    (2016)
  • H. Liu et al.

    Multimodal measurements fusion for surface material categorization

    IEEE Trans. Instrum. Meas.

    (2018)
  • M. Liu et al.

    Low-rank multi-view learning in matrix completion for multi-label image classification.

    AAAI

    (2015)
  • M. Luo et al.

    Adaptive unsupervised feature selection with structure regularization

    IEEE Trans. Neural Netw. Learn Syst.

    (2018)
  • Z. Ma et al.

    Joint attributes and event analysis for multimedia event detection

    IEEE Trans. Neural Netw. Learn Syst.

    (2017)
  • Cited by (12)

    • CNN‐Transformer for visual‐tactile fusion applied in road recognition of autonomous vehicles

      2023, Pattern Recognition Letters
      Citation Excerpt :

      As shown in Fig. 5, there are two modality-specific networks located at the entry of the overall architecture. It should be noted that the input data pair, including the road image and the corresponding PVDF signal spectrogram, is weakly paired matching [33]. Moreover, the different modalities present high discrepancy, leading to inconsistent data distribution and representation.

    • Hybrid SOM based cross-modal retrieval exploiting Hebbian learning

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Various multi-modal datasets suffer from a weak-pairing issue where one concept of data samples in one mode corresponds to the same concept of data samples in another mode rather than direct sample-to-sample correspondence among modalities which introduces a challenge in cross-modal retrieval. Authors in [31] have proposed a novel scalable hierarchical learning framework dubbed deep dictionary learning (DDL) to handle this challenge which considers the cross-modal representations without direct correspondence and minimal concept label supervision. For dealing with label supervision, a shared classifier is introduced across diverse modalities and for modal invariant representation, a multi-modal low-rank model is proposed.

    • CHOP: An orthogonal hashing method for zero-shot cross-modal retrieval

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Due to the cross-modal data (such as images and texts) have significantly different statistical properties (heterogeneous gap) [1], it is impossible to measure the similarity in a direct way. To this end, cross-modal retrieval has been widely investigated recently whose main idea is learning a common representation for various modalities [2–4]. Due to the impressive performance on low storage cost and high retrieval speed, the cross-modal approaches based on hashing have drawn considerable attention [5–8].

    • Traffic prediction using MSSBiLS with self-attention model

      2022, Concurrency and Computation: Practice and Experience
    View all citing articles on Scopus
    View full text