Top

Complex & Intelligent Systems

Published in:

Open Access 14-08-2023 | Original Article

Parameter sharing and multi-granularity feature learning for cross-modality person re-identification

Authors: Sixian Chan, Feng Du, Tinglong Tang, Guodao Zhang, Xiaoliang Jiang, Qiu Guan

Published in: Complex & Intelligent Systems | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Visible-infrared person re-identification aims to match pedestrian images between visible and infrared modalities, and its two main challenges are intra-modality differences and cross-modality differences between visible and infrared images. To address these issues, many advanced methods attempt to design new network structures to extract modality-sharing features, mitigate modality differences, or learn part-level features to overcome background interference. However, they ignore the parameter sharing of the convolutional layers to obtain more modality-sharing features. At the same time, only using part-level features lack discriminative pedestrian representations such as body structure and contours. To handle these problems, a parameter sharing and feature learning network is proposed in this paper to mitigate modality differences and further enhance feature discrimination. Firstly, a new two-stream parameter sharing network is proposed, by sharing the convolutional layers parameters to obtain more modality-sharing features. Secondly, a multi-granularity feature learning module is designed to reduce modality differences at both coarse and fine-grained levels while further enhancing feature discriminability. In addition, a center alignment loss is proposed to learn relationships between identities and to reduce modality differences by clustering features into their centers. For the part-level feature learning, the hetero-center triplet loss is adopted to alleviate the strict constraints of triplet loss. Finally, extensive experiments are conducted to validate our method outperforms state-of-the-art methods on two challenging datasets. In the SYSU-MM01 dataset, the Rank-1 and mAP reach $74.0\%$ and $70.51\%$ in the all-search mode, which is an increase of $3.4\%$ and $3.61\%$ to baseline, respectively.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Person re-identification (Re-ID) can be regarded as an image retrieval task which aims to match images of pedestrians taken by visible cameras. Since the person Re-ID plays an important role in the field of public safety and video surveillance [1], it has been extensively researched in the field of computer vision. Visible-Visible Re-ID (VV-ReID) is a single-modality person re-identification task. For this task, the query images and the gallery images are taken by a visible camera. With the continuous development of convolutional neural networks [2], high performance has been achieved in single-modality pedestrian re-identification [3]. However, in practical application scenarios, such as intelligent video surveillance, cameras that can automatically switch from visible to infrared depending on the light level are widely used. The visible camera captures three-channel RGB images, while the infrared camera captures single-channel infrared (IR) images, which are two different modalities. How to achieve matching between RGB images and IR images has attracted a lot of attention in the field of computer vision. This requirement leads to the emergence of the cross-modality person re-identification (VI-ReID) [4].

There are two main challenges in the VI-ReID task, one is the inter-modality differences between the images captured by the visible and infrared cameras, and the other is the intra-modality differences caused by background interference, camera angle, etc. To mitigate modality differences the main approaches are divided into two types: GAN-based and feature extraction-based methods. The GAN-based [5‐8] methods employ generators to convert RGB images into fake IR images, mitigating modal differences in pixel space. However, the GAN-based methods have the following drawbacks [9]: firstly, the network is difficult to converge, and secondly, the generator destroys the structural and textural information of the body making it difficult to extract discriminative features. The feature extraction-based [10, 11] methods attempt to extract modality-sharing features and project them into a unified feature space where the loss functions [12, 13] are used to minimize modality differences. The feature extraction-based methods employ ResNet50 [2] as the feature extraction network and project the features into the common space using a fully connected layer (FC) with parameter sharing. However, limited by the FC, this mechanism only deals with a one-dimensional tensor and ignores three-dimensional human spatial structure. In contrast, the information of three-dimensional human spatial structure can provide effective modality-sharing features for VI-ReID. In addition, GAN-based and feature extraction-based approaches ignore a more promising research field: hybrid learning between metaheuristics and machine learning. For example, Ramanan [14] et al, propose a CPS classification model for the detection of X-ray images of the breast to detect and classify different stages of cancer. Sonawane [15] et al, propose tracking and hunting based meta-heuristic deep neural networks for prediction of faults occurring in inverters under various load conditions.

The PCB [16] method is used to extract effective discriminatory features and suppress background interference in [9, 17]. The PCB method slices the pedestrian features horizontally to obtain part-level features. Then cross-entropy loss [18] and triplet loss [12] are calculated for each slice to enhance feature discrimination and perform part-level feature alignment. However the PCB method has the following limitations: firstly, the PCB method uses horizontal slicing to obtain part-level features, destroying the complete pedestrian semantic information; secondly, the PCB method only considers part-level feature alignment and lacks exploration of global information (body structure and contour).

To address the above issues, a parameter sharing and feature learning network (PSFLNet) is proposed in this paper to mitigate modality differences and further enhance feature discrimination. Firstly, to overcome the drawbacks of the GAN-based methods, a lightweight middle modality generator (MMG [9]) is used to generate middle modality images to mitigate the modality differences between visible and infrared modalities. Secondly, considering that convolutional layers are able to handle tensor with three-dimensional spatial structure, we explore sharing some convolutional layer parameters instead of FC to embed spatial structure features of the body into a uniform feature space to obtain more modality-sharing features. Based on the above considerations, we design a new two-stream parameter-sharing network (TPSN) based on the middle modality built in the first stage. Furthermore, to overcome the fact that PCB can only acquire local features and lack exploration of global information (body structure and contour), we designed multi-granularity feature learning (MGFL) module to jointly use global and part-level features for cross-modality alignment. For part-level features, we utilize the hetero-center triplet loss [19] to expand the inter-class separation and minimize the intra-class separation as well as overcome the disadvantage of bad samples. For global information, we designed a center alignment loss to learn relationships between identities and cluster features to their centres to reduce modality differences.

The proposed PSFLNet method is validated on two challenging datasets (SYSU-MM01 [4], RegDB [20]) and performs significantly better than other state-of-the-art methods on both RegDB and SYSU-MM01 datasets. It is worth noting that our PSFLNet method achieves Rank-1 accuracy = 74.0% and mAP = 70.51% on the SYSU-MM01 dataset (all-search settings) and Rank-1 accuracy = 79.5% and mAP = 82.1% in indoor-search mode. On the RegDB dataset, under the Visible-to-Infrared settings, Rank-1 accuracy = 95.87% and mAP accuracy = 91.08% are achieved.

The main contributions can be generalized as follows:

We build a parameter sharing and feature learning network (PSFLNet) based on the middle modality to alleviate the differences between modalities and further enhance feature discrimination for VI-ReID. The PSFLNet contains a two-stream parameter sharing network (TPSN) and a multi-granularity feature learning module (MGFL).
The TPSN is presented to explore sharing convolutional layer parameters to embed spatial structure features of the body into a uniform feature space to obtain more modality-sharing features.
The MGFL is proposed to jointly use global and part-level features for cross-modality feature alignment to overcome the PCB’s lack of exploration of global information. For global feature, the center alignment loss is designed to probe into the relationship between identities and cluster the features to their centers to reduce modality difference. For part-level feature, the hetero-center triplet loss is adopted to alleviate the strict constraints of triplet loss.
Extensive experiments illustrate the proposed method performs much better than other competing approaches on the SYSU-MM01 and RegDB datasets.

Network designing

In VI-ReID, to better extract human features, researchers have designed a variety of deep neural networks. There are a lot of works focusing on the input of one-stream networks, sharing parameters across the network to exploit information from both visible and infrared images. Wu et al. [21] were the first to propose a cross-modality person Re-ID problem and contributed the SYSU-MM01 dataset. They employed a zero-padding approach to abstract modality-shared features. Hou et al. [22] proposed a Dense Feature Pyramid Network (DFPN), which could converge to better performance without pretraining. Wu et al. [23] designed a modal mitigation module and inserted it after the res-layer of the one-stream network to alleviate the modality difference. In addition, Wang et al. [5] adopted a generative adversarial network (GAN) to mitigate modality differences in pixel space. The visible images were first generated to corresponding infrared images and then fed into a one-stream network to extract modality-shared features.

Ye et al. [3] first proposed a two-stream network that used parameter-independent DNNs to extract modality-specific features, and then projected these features into a general feature space via a parameter-sharing fully connected (FC) layer. To better learn modality-sharing features, Dai et al. [24] proposed a cross-modality generative adversarial network consisting of a generator that converted visible images to infrared images and a discriminator to segregate whether the input features were from visible or infrared modalities. As the relationship between visible and infrared images was non-linear [25], the GAN-based approaches could not transform the modalities while ensuring that the person’s identity remained consistent. To overcome the shortcomings of GAN-based methods, Zhang et al. [9] proposed a middle modality generator to generate middle modalities to alleviate modal differences. But they neglected sharing convolutional layer parameters to embed spatial structure features of the body into a uniform feature space to obtain more modality-sharing features. In contrast to the above methods, we design a new two-stream parameter-sharing network (TPSN) based on middle modality. In TPSN, we constructed a parameter-independent feature extraction network and a parameter-sharing feature embedding network by splitting the ResNet50 [2]. In the parameter-sharing feature embedding network, the convolutional layers are focused to explore the parameter sharing for preserving spatial structure messages of the human body.

Feature learning

Feature learning has greatly contributed to improved cross-modality person re-identification performance, and researchers typically perform a variety of operations on features extracted by deep neural networks. Sun et al. [16] proposed a PCB module for extracting discriminative features, and they sliced the person horizontally to get the finer-grained features. Taking into account the channel deviations between the visible and infrared ones, Wu et al. [23] segmented the extracted features from the channels and made each channel focused on a different part of the body to discover subtle differences. Inspired by PCB [16], Feng et al. [26] sliced a person in the horizontal direction. Each slice was regarded as a node, and the L2 distance between slices was used as the metric between nodes for mapping. Ling et al. [17] used cumulative slices to obtain fine-grained features and performed finer modality alignment. In the PCB-based methods, they ignore the fact that PCB destroys the semantic information of the body and lacks the exploitation of global information. To overcome the fact that PCB can only acquire local features and lack exploration of global information (body structure and contour), we designed multi-granularity feature learning (MGFL) module to jointly use global and local features for cross-modality alignment.

Metric learning

Metric learning methods are widely used for cross-modality person Re-ID. For this task, the metric method intends to learn the similarity between cross-modality image pairs through the network. A reasonable measurement method (or loss function) can make the interval between sample images under the same identity as small as possible, and the separation between sample images under varying identities as large as possible. Ye et al. [10] proposed a bi-directional ranking loss to learn discriminative feature representations in two-stream networks. For reducing the distance between cross-modal image pairs, the cross-entropy loss [27] was incorporated to increase the discriminativeness of identity features. Liu et al. [28] designed a two-modality triple loss that considered both inter-modality differences and intra-modality variations to guide network learning. Since it was difficult and expensive to directly constrain the distance between modal distributions, Zhu et al. [13] calculated the distance between the centers of the modal distributions and proposed a hetero-center loss. Liu et al. [19] advanced hetero-center triple loss to mitigate the strict constraints of triple loss. This loss replaced the comparison of the anchor sample with all other samples with a comparison of the center of the anchor sample with the center of all other samples, which not only improved the performance of Re-ID but also reduced the computational cost. However, the above losses ignore the fact that the relationship between identities can provide discriminative information to distinguish them. In our work, we design a center alignment loss to probe into the relationship between identities and cluster the features to their centers for global feature learning. And we adopt the hetero-center triplet loss [19] for part-level feature learning to alleviate the strict constraints of triplet loss.

Methods

In this section, we introduce the overall structure of the proposed PSFLNet, as shown in Fig. 1. Then, we provide a detailed description of the designs of the middle modality generator (MMG), two-stream parameter-sharing network (TPSN), and multi-granularity feature learning (MGFL). Finally, a multi-loss scheme is advanced to optimize the PSFLNet in an end-to-end manner.

Model construction

Figure 1 illustrates the proposed PSFLNet. The visible (VIS) and infrared (IR) images are fed into the MMG module to breed middle-modality (M-modality) images ($M_{vis}$ and $M_{ir}$), respectively. The produced M-modality images have the same labels as the VIS images and the IR images. Then, the M-modality images ($M_{vis}$, $M_{ir}$) are fed together with the VIS images and IR images into our proposed TPSN to extract effective features. The TPSN consists of parameter-independent feature extraction network and a parameter-sharing feature embedding network. The parameter-independent feature extraction network is applied to extract modal-specific features, while the parameter-sharing feature embedding network is designed to map features to a uniform feature space to obtain modality-shared features. Finally, the proposed MGFL deals with features from coarse-grained and fine-grained, respectively. For part-level feature learning, we first cut the features into P slices, and then feed each slice into the classifier to explore local cues. To retain more information about the spatial structure, we reduce the stride size of the last convolutional block of the network to 1. To accelerate the convergence of the network, we add the batch normalization layer (BN) after the average pooling layer (AP) and then calculate different losses for the features before and after the BN layer, respectively. For global feature learning, we design the center alignment loss ($\mathcal {L}_{ca}$) to constrain the average pooled features.

Middle modality generator (MMG)

Considering the advantages of the intermediate modality, we adopt MMG to generate intermediate modal images for relieving the deviations between visible and infrared images. The input to the MMG is pairs of images from different modalities under the same identity. Let $I=\left\{ I_{vis}, I_{ir} \mid I_{vis}, I_{ir} \in \mathbb {R}^{3 \times H \times W}\right\} $ bespeak the packs of VIS and IR images in several. Where I is a pair of images, $3\times H \times W$ corresponds to channel, height and width separately. All input images are adjusted to $3\times 384 \times 192$, so we can process IR images as three-channel images. We set $I_m=\left\{ I_{M_{vis}}, I_{M_{ir}} \mid I_{M_{vis}}, I_{M_{ir}} \in \mathbb {R}^{3 \times H \times W}\right\} $ as the M-modality. $I_{M_{vis}} $ and $I_{M_{ir}}$ images are produced from the $I_{vis}$ and $I_{ir}$. MMG consists of two modality information encoders ($E_{vis}$ and $E_{ir}$) with independent parameters and a modality information decoder D with shared parameters. Where D projects the visible and infrared images into a unified feature space. For details, refer to MMN [9].

For the VIS images, we have:

$$\begin{aligned} I_{Vtc}={E}_{vis}\left( I_{vis}\right) , \end{aligned}$$

(1)

For the IR images, we have:

$$\begin{aligned} I_{Itc}={E}_{ir}\left( I_{ir}\right) \end{aligned}$$

(2)

Given that the modality deviations between VIS and IR images are mainly caused by the channels [25]. $E_{vis}$ and $E_{ir}$ first encode three-channel images $I_{vis}$ and $I_{ir}$ into single channel image $I_c=\left\{ I_{Vtc},I_{Itc} \mid I_{Vtc},I_{Itc}\in \mathbb {R}^{1 \times H \times W}\right\} $ using a 3 $\times $ 1 fully connected layer. Where $ I_{Vtc} $ and $I_{Itc}$ are the single channel images encoded by the encoders $E_{vis}$ and $E_{ir}$. Then, as the relationship is highly non-linear between the VIS images and the IR images, a ReLU activation layer [29] is used to build up the non-linear capability of the network. Ultimately, a $1\times 1$ fully connected layer is applied to restructure the $I_{c}$.

We use the parameter-sharing decoder D to decode the single-channel images $I_{c}$ into middle-modality images $I_m$. That is:

$$\begin{aligned} I_{M_{vis}}={D}(I_{Vtc}), \quad I_{M_{ir}}={D}(I_{Itc}), \end{aligned}$$

(3)

To obtain three-channel M-modality images $I_{M_{vis}}$ and $I_{M_{ir}}$, decoder D is designed to consist of a $1\times 3$ fully connected layer and a ReLU activation layer. Finally, $\{I_{vis}, I_{M_{vis}}\}$ and $\{I_{ir}, I_{M_{ir}}\}$ are sent together into the backbone network to reduce modality discrepancy.

The two-stream network is proposed by fusing a visible and an infrared pathway. It has a shallow network with independent parameters for extracting modality-specific information from different modality images and then uses a fully connected layer with shared parameters to learn the embedding space. Typically two-stream backbone networks are composed of two parts: a feature extraction network and a feature embedding network. Feature extraction network is designed to achieve modality-specific information. Feature embedding network learns modality-shared features by mapping modality-specific features to a communal feature space. Inspired by literature [30], we design a new two-stream parameter-sharing network based on the middle modality. Its input consists of the middle modality images $I_{M_{vis}}$, $I_{M_{ir}}$ and VIS, IR images. How to design the two-stream parameter-sharing network is the focus of our research, and the two-stream network has the following two problems: Firstly, if the feature extraction network contains two independent branches, the number of parameters of the network will be doubled. Secondly, if the feature embedding network only consists of fully connected layers, it can only process 1D feature vectors without any human spatial structure information, which is crucial for extracting modality-invariant features.

In order to solve the above two problems effectively, we divide the ResNet50 model into two parts. The first part is used to construct a parameter-independent feature extraction network, and the latter part is set up as a parameter-sharing feature embedding network containing residual convolution blocks. In this way, not only the parameters of the network become smaller but also the residual convolution layer of the feature embedding network can process the 3D features output from the feature extraction network to conserve the spatial structure of the human body.

We denote the parameter-independent feature extraction networks for the visible and infrared branches as $\varphi _{v}$ and $\varphi _{i}$, respectively, to extract modality-specific features, as shown in Fig. 1. The parameter-sharing feature embedding network is denoted as $\varphi _{vi}$ for learning modality-sharing features. We first concatenate the images $I_{vis}$, $I_{M_{vis}}$, $I_{ir}$ and $I_{M_{ir}}$ on a batch and then feed them together into our proposed two-stream parameter sharing network, thus forming a batch of size 4M, where M denotes the number of input images in each modality. The 3D features output by the network can be expressed as: $F\in \left\{ F_{vis}, F_{M_{vis}},F_{ir},F_{M_{ir}}\right\} $ where:

$$\begin{aligned} \left\{ \begin{array}{l} F_{vis}=\varphi _{vi}\left( \varphi _v\left( I_{vis}\right) \right) \\ F_{M_{vis}}=\varphi _{vi}\left( \varphi _v\left( I_{M_{vis}}\right) \right) \\ F_{ir}=\varphi _{vi}\left( \varphi _i\left( I_{ir}\right) \right) \\ F_{M_{ir}}=\varphi _{vi}\left( \varphi _i\left( I_{M_{ir}}\right) \right) \end{array}\right. \end{aligned}$$

(4)

where $F_{vis}$ and $F_{ir}$ are the visible and infrared features output by the network, $F_{M_{vis}}$ and $F_{M_{ir}}$ are the middle modality features output by the network.

Considering the excellent performance on the ReID task and the simple network structure of ResNet50 [2], We use it as the backbone network. The network includes a shallow convolutional layer (layer0) and four residual convolutional blocks (layer1, layer2, layer3, and layer4). To explore how to split this network to build our parameter-independent feature extraction network and parameter-sharing feature embedding network, we have tried four splitting schemes, as shown in Fig. 2, where $S_i$, $i=\left\{ 0,1,2,3\right\} $. For example, $S_{1}$ represents that using layer2, layer3 and layer4 to build the feature embedding network with parameter-shared, while layer0 and layer1 are used to construct the parameter-independent feature extraction network. In the parameter-sharing feature embedding network, the convolutional layers are focused to explore the parameter sharing for preserving spatial structure messages of the human body.

Multi-granularity feature learning (MGFL)

In VV-ReID task, many popular methods typically employ slicing operation to obtain finer features and reduce the interference of background information. Inspired by [16], we adopt a uniform horizontal segmentation method to obtain the part-level features. At the same time, we probe into the effect of the number of p slices on VI-ReID performance. However, the PCB method destroys the semantic information of the body and lacks the exploitation of global information. To overcome the fact that PCB can only acquire local features and lack exploration of global information (body structure and contour), we designed multi-granularity feature learning (MGFL) module to jointly use global and part-level features for cross-modality alignment.

Given the pedestrian images, the features extracted by our two-stream network are represented as 3D feature maps. As shown in Fig. 1, we perform the following operations on the features:

Part-level feature learning:

(1)

The 3D feature maps are divided evenly into P strips from the horizontal to produce part-level feature maps, as shown in Fig. 1.

(2)

Given a part-level feature map $X\in \mathbb {R}^{C \times H \times W}$, we transform the 3D feature map into $X^{'}\in \mathbb {R}^{C \times 1 \times 1}$ using an average pooling (AP) layer. Then, we reshape $X^{'}$ into a 1D feature vector.

(3)

The hetero-center triplet loss ($\mathcal {L}_{hc\_tri}$) is calculated for each 1D feature vector. Then, the 1D feature vectors are normalized and sent to the fully-connected layer to perform the classification task and calculate the classification loss $\mathcal {L}_{id}$. P feature vectors require P normalization layers and P classifiers, and they are all parameter-independent.

Global feature learning:

(1)

The 3D feature maps are not segmented and global information is preserved. Given the global feature maps $Y\in \mathbb {R}^{C \times H \times W}$, we use average pooling (AP) layer to pool the 3D global features into $Y^{'}\in \mathbb {R}^{C \times 1 \times 1}$. Then, we reshape $Y^{'}$into 1D global feature vectors.

(2)

The center alignment loss ($\mathcal {L}_{ca}$ ) is designed to constrain the 1D global feature vectors. The purpose of the $\mathcal {L}_{ca}$ is to learn the relationship between identities and clustering the features to their centers.

The MGFL module can reduce modality differences at both the coarse-grained and fine-grained levels as well as enhance feature discrimination.

Loss optimization

In part-level feature learning, to close the separation between the middle modality images, we calculate the distribution conformance loss ($\mathcal {L}_{dcl}$) [9] for the M-modality images. $\mathcal {L}_{dcl}$ is calculated as:

$$\begin{aligned} \mathcal {L}_{dcl}=\frac{1}{N} \sum _{i=1}^N {\text {mean}}\left[ f\left( I_{M_{vis}}^i\right) -f\left( I_{M_{ir}}^i\right) \right] \end{aligned}$$

(5)

where $f(\cdot )$ is the output of the network before the fully connected layer and N is the number of $I_{M_{vis}}$ and $I_{M_{ir}}$. $Mean[A-B]$ is the average of the difference between A and B. It is evident that the optimization by $L_{dcl}$ will make the M-modality images more similar.

Except for $\mathcal {L}_{dcl}$, we also optimize our PSFLNet using label-smoothed cross-entropy loss($\mathcal {L}_{id}$) [18] and hetero-center triplet loss($\mathcal {L}_{hc_{-} tri}$) [19]. Especially, we design central alignment loss ($L_{ca}$) for dealing with the global feature.

The $\mathcal {L}_{i d}$ loss is widely used in classification tasks and can effectively prevent model overfitting [18]. The $\mathcal {L}_{i d}$ is formulated as:

$$\begin{aligned} \mathcal {L}_{i d}=\sum _{i=1}^C-q_i \log \left( p_i\right) , q_i=\left\{ \begin{array}{ll} 1-\frac{C-1}{C} \varepsilon &{} \quad i=y \\ \varepsilon / C &{} \quad i \ne y \end{array},\right. \end{aligned}$$

(6)

where C is the number of classes of persons in the training set, $p_i$ is the prediction of logits for class i, y represents the label for the i-th person, $\varepsilon $ is a small constant to encourage the model to be less confident on the training set [31].

Then, we replace the triplet loss with a $\mathcal {L}_{hc_{-} tri}$ loss. It can alleviate the tight constraint on triplet loss by replacing the comparison of the anchor sample with all other samples with a comparison of the anchor sample center with the other sample centers. Since we adopt M-modality images to assist in the training of PSFLNet. A batch of size 4M is formed, where M represents the number of input images in each modality. $M=\eta \times K$, where $\eta $ is the number of identities and K is the number of images under each identity.

In a mini-batch, the central feature of each identity is calculated by the following formula.

$$\begin{aligned} c_V^i{} & {} =\frac{1}{K} \sum _{j=1}^K V_j^i, c_I^i=\frac{1}{K} \sum _{j=1}^K I_j^i, \nonumber \\ c_{M_{v is}}^i{} & {} =\frac{1}{K} \sum _{j=1}^K\left( M_{v i s}\right) _j^i, c_{M_{i r}}^i =\frac{1}{K}\sum _{j=1}^K\left( M_{i r}\right) _j^i \nonumber \\ \end{aligned}$$

(7)

where, $V^{i}_{j}$ denotes the j-th visible image feature of the i-th person in the mini-batch. $I^{i}_{j}$, ${{M_{vis}}_j}^i$ and ${{M_{ir}}_j}^i$ correspond to the features of the j-th image in the i-th identity in the IR, $M_{vis}$ and $M_{ir}$. The $\mathcal {L}_{hc_{-} tri}$ loss between $I_{vis}$ and $I_{ir}$ is formulated as:

$$\begin{aligned}{} & {} \mathcal {L}_{hc_{-} tri}(V, I)\nonumber \\{} & {} \quad = \sum _{i=1}^\eta \left[ \rho +\left\| c_V^i-c_I^i\right\| _2-\min _{\begin{array}{c} n \in \{V, I\}\\ j\ne i \end{array}}\left\| c_V^i-c_n^j\right\| _2\right] _{+} \nonumber \\{} & {} \qquad +\sum _{i=1}^\eta \left[ \rho +\left\| c_I^i-c_V^i\right\| _2-\min _{\begin{array}{c} n\in \{V, I\} j\ne i \end{array}}\left\| c_I^i-c_n^j\right\| _2\right] _{+} \end{aligned}$$

(8)

where $\eta $ is the number of identities. V is the feature centers of the visible images ($\left\{ c_V^i \mid i=1, \cdots , \eta \right\} $). I is the feature centers of the infrared images ($\left\{ c_I^i \mid i=1, \cdots , \eta \right\} $). Where $c_{n}^j$ is the feature centre of the anchor sample from V and I. $\rho $ is a margin parameter and $[R]_{+}=\max (R, 0)$. $\Vert A-B \Vert _2$ is to calculate the Euclidean distance between A and B. For each identity, $\mathcal {L}_{hc\_tri}$ concentrates on a single cross-modality positive pair and mines the hardest negative pairs obtained within and between modalities.

Computing the $\mathcal {L}_{hc_{-} tri}$ loss for other modalities is the same as Eq. (8). The final hetero-center triplet loss function of the network is calculated as:

$$\begin{aligned} \mathcal {L}_{hc\_tri}={} & {} \mathcal {L}_{hc\_tri}(V, I)+\mathcal {L}_{hc\_tri}\left( V, M_{i r}\right) \nonumber \\{} & {} +\mathcal {L}_{hc\_tri}\left( I, M_{ {vis }}\right) +\mathcal {L}_{hc\_tri}\left( M_{{vis }}, M_{{ir }}\right) \end{aligned}$$

(9)

where $M_{vis}$ and $M_{ir }$ represent the feature centers of the middle modal images generated by MMG.

Table 1

Results compared with state-of-the-art methods on SYSU-MM01 datasets

Model	Pub.	SYSU-MM01
		All Search				Indoor Search
		R-1	R-10	R-20	mAP	R-1	R-10	R-20	mAP
HCML [10]	AAAI’18	14.3	53.2	69.2	16.2	24.5	73.3	86.7	30.1
ZERO-Padding [21]	ICCV’17	14.8	54.1	71.3	15.9	20.6	68.4	85.8	26.9
BDTR [30]	IJCAI’18	17	55.4	72.0	19.7	–	–	–	–
HSME [39]	AAAI’19	20.7	62.8	78.0	23.2	–	–	–	–
cmGAN [24]	IJCAI’18	27.0	67.5	80.6	27.8	31.7	77.2	89.2	42.4
MAC [36]	MM’19	33.3	79.0	90.1	36.2	33.4	82.5	93.7	45.0
SNR [37]	CVPR’20	34.6	75.9	86.6	33.9	40.9	83.8	91.8	50.4
MHM [35]	AAAI’20	35.9	73	86.1	38.0	–	–	–	–
MSR [38]	TIP’20	37.4	83.4	93.3	38.1	39.6	89.3	97.7	50.9
expAT [11]	TIP’20	38.6	76.6	86.4	38.6	–	–	–	–
SSFT [40]	CVPR’20	47.7	–	–	54.1	–	–	–	–
DFE [41]	MM’19	48.71	88.86	95.27	48.59	52.25	89.86	95.85	59.68
DDAA [33]	ECCV’20	54.8	90.4	95.8	53.0	61.0	94.1	98.4	68.0
NFS [42]	CVPR’21	56.9	91.3	96.5	55.5	62.8	96.5	99.1	69.8
MPANet [23]	CVPR’21	70.58	96.12	98.8	68.24	76.74	98.21	99.57	80.95
DCLNet [43]	MM’22	70.79	–	–	65.18	73.51	–	–	76.80
MAUM [44]	CVPR’22	71.68	–	–	68.79	76.97	–	–	81.94
D2RL [7]	CVPR’19	28.9	70.6	82.4	29.2	–	–	–	–
Hi-CMD [45]	CVPR’20	34.9	77.6	–	35.9	–	–	–	–
JSIR-ReID [6]	AAAI’20	38.1	80.7	89.9	36.9	43.8	86.2	94.2	52.9
AlignGAN [5]	ICCV’19	42.4	85.0	93.7	40.7	45.9	87.6	94.4	54.3
X-Modality [25]	AAAI’20	49.9	89.8	96.0	50.7	–	–	–	–
DG-VAE[46]	MM’20	59.5	93.8	–	58.5	–	–	–	–
FMCNet [47]	CVPR’22	66.34	–	–	62.51	68.15	–	–	74.09
MMN [9]	MM’21	70.6	96.2	99.0	66.9	76.2	97.2	99.3	79.6
Ours		74.0	96.5	99.0	70.51	79.5	97.5	99.24	82.1

All results obtained with competitive methods are those obtained with single-shot mode

Furthermore, for global features the center alignment loss ($\mathcal {L}_{ca}$) is proposed to learn the relationship between identities and cluster the features to their centers. Given the global features ($V^{'}$, $I^{'}$), the $\mathcal {L}_{ca}$ can be calculated as:

$$\begin{aligned} \mathcal {L}_{ca}(V^{'},I^{'})= & {} \frac{1}{2 M} \sum _{i=1}^{2 M}\left\| \textbf{F}_i-\textbf{0}_{y_i}\right\| _2 \nonumber \\{} & {} +\frac{2}{\eta (\eta -1)} \sum _{k=1}^{\eta -1} \sum _{j=k+1}^\eta \left[ \gamma -\left\| \textbf{0}_{{y}_k}-\textbf{0}_{{y}_j}\right\| _2\right] _{+} \nonumber \\ \end{aligned}$$

(10)

where $V^{'}$ and $I^{'}$ represent the global features of the visible images and the infrared images respectively. Where $\textbf{0}_{y_{i}}$ denotes the average of the image features labelled as $y_i$, $\textbf{F}_i$ is the feature of the i-th image, $\eta $ is the number of identities and $\gamma $ is the least margin among the centers. Where $\textbf{0}_{y_{k}}$ denotes the average of the image features labelled as $y_k$, $\textbf{0}_{y_{j}}$ denotes the average of the image features labelled as $y_j$. 2M is the number of images when visible images and infrared images are concatenated. The $\mathcal {L}_{ca}$ loss directly establishes relationships between classes rather than between samples, which builds identity learning at the class level. As a result, making the features of different identities discriminative can help to reduce modality differences.

Computing the center alignment loss for other modalities is the same as Eq. (10). The final center alignment loss function of the network is calculated as follows:

$$\begin{aligned} \mathcal {L}_{ca}={} & {} \mathcal {L}_{ca}(V^{'},I^{'})+\mathcal {L}_{ca}(V^{'},M^{'}_{vis}) \nonumber \\{} & {} +\mathcal {L}_{ca}(M^{'}_{vis},I^{'}) +\mathcal {L}_{ca}(M^{'}_{vis},M^{'}_{ir}) \end{aligned}$$

(11)

where $M^{'}_{vis}$ and $M^{'}_{ir}$ represent the global features of the middle modality images generated by MMG.

The total loss $\mathcal {L}_{total}$ of the PSFLNet is defined as:

$$\begin{aligned} \mathcal {L}_{total} = \mathcal {L}_{id} + \lambda _1\mathcal {L}_{dcl} +\lambda _2\mathcal {L}_{hc\_tri} +\lambda _3\mathcal {L}_{ca} \end{aligned}$$

(12)

where $\lambda _1$, $\lambda _2$, $\lambda _3$ are weights.

Experiments

In this section, we detail the settings of the proposed PSFLNet. Then, comparisons and analyses of PSFLNet with other state-of-the-art methods are introduced. We also perform ablation experiments to prove the efficacy of the modules proposed in PSFLNet. Finally, we illustrate the superior competitiveness of our PSFLNet by visualizing the retrieval results.

Table 2

Results compared with state-of-the-art methods on RegDB datasets

Model	Pub.	RegDB
		Visible to Infrared				Infrared to Visible
		R-1	R-10	R-20	mAP	R-1	R-10	R-20	mAP
ZERO-Padding [21]	ICCV’17	17.8	34.2	44.4	18.9	16.6	34.7	44.3	17.8
HCML [10]	AAAI’18	24.4	47.5	56.8	20.8	21.7	45.0	55.6	22.2
cmGAN [24]	IJCAI’18	–	–	–	–	–	–	–	–
MHM [35]	AAAI’20	31.1	47.0	58.6	32.1	–	–	–	–
BDTR [30]	IJCAI’18	33.6	58.6	67.4	32.8	32.9	58.5	68.4	32.0
MAC [36]	MM’19	36.4	62.4	71.6	37.0	–	–	–	–
SNR [37]	CVPR’20	–	–	–	–	–	–	–	–
MSR [38]	TIP’20	48.4	70.3	80.0	48.7	–	–	–	–
HSME [39]	AAAI’19	50.9	73.4	81.7	47.0	50.2	72.4	81.1	46.2
SSFT [40]	CVPR’20	65.4	–	–	65.6	63.8	–	–	64.2
expAT [11]	TIP’20	66.5	–	–	67.3	67.5	–	–	66.5
DDAA [33]	ECCV’20	69.3	86.2	91.5	63.5	68.1	85.2	90.3	61.8
DFE [41]	MM’19	70.2	–	–	69.2	68.0	–	–	66.7
NFS [42]	CVPR’21	80.5	91.6	95.1	72.1	78.0	90.5	93.6	69.8
DCLNet [43]	MM’22	81.2	–	–	74.3	78.0	–	–	70.6
MPANet [23]	CVPR’21	83.7	–	–	80.9	82.8	–	–	80.7
MAUM [44]	CVPR’22	87.87	–	–	85.09	86.95	–	–	84.34
D2RL [7]	CVPR’19	43.4	66.1	76.3	44.1	–	–	–	–
JSIR-ReID [6]	AAAI’20	48.1	–	–	48.9	48.5	–	–	49.3
AlignGAN [5]	ICCV’19	57.9	–	–	53.6	56.3	–	–	53.4
X-Modality [25]	AAAI’20	62.2	83.1	91.7	60.2	–	–	–	–
Hi-CMD [45]	CVPR’20	70.9	86.4	–	66.0	–	–	–	–
DG-VAE [46]	MM’20	73.0	86.9	–	71.8	–	–	–	–
FCMNet [47]	CVPR’22	89.12	–	–	84.43	88.38	–	–	83.86
MMN [9]	MM’21	91.6	97.7	98.9	84.1	87.5	96.0	98.1	80.5
Ours		95.87	98.63	99.23	91.08	92.32	97.45	98.53	88.28

Datasets and experimental settings

$\mathbf {Dataset.}$ We evaluate our method on two challenging public datasets SYSU-MM01 [4] and RegDB [20].

SYSU-MM01 [4] is a large-scale cross-modal dataset that contains images taken by four visible cameras and two infrared cameras, including both internal and external environments. The training set contains 22,258 visible images and 11,909 infrared images from 395 identities, and the testing set contains 3803 infrared images and 301 randomly selected visible images. In all-search mode, the gallery set contains indoor and outdoor images captured by four visible cameras. In indoor-search mode, the gallery set contains only visible images captured by two indoor visible cameras. The all-search mode is more challenging than the indoor-search mode because scenarios are more complex. We follow the evaluation protocol as [21] to perform ten trials of gallery set selection and then report the average performance.

The RegDB dataset [20] is collected by two aligned cameras (one visible and one infrared) and contains 412 identities. Each of these identities contains 10 visible images and 10 infrared images. We follow an evaluation protocol [10] in which the dataset is randomly divided into two parts for training and testing, respectively. Specifically, images from one modality are used as the gallery set, while images from the other modality are treated as the query set in the stage of testing. The tests are repeated 10 trials and the average results are reported.

$\mathbf {Evaluation\quad Protocol.}$ Cumulative matching characteristics (CMC) and mean average precision (mAP) are adopted as evaluation metrics.

$\mathbf {Experimental\quad details.}$ We implement our model through the PyTorch framework and train it on a single NVIDIA GeForce 3090 GPU. We employ ResNet50, which is pre-trained on the ImageNet dataset [32], as the backbone network. In each mini-batch, we randomly sample 4 identities and 8 images for each identity, similar to [33]. The model is trained for 80 epochs in total. All input images are first resized to $3\times 384\times 192$, and data augmentation is performed with horizontal flipping and random erasing [34]. All experiments are optimized by the SGD optimizer and the momentum parameter is set to 0.9. In addition, a warm-up strategy [18] is used to smooth the learning gradient. The initial learning rate is $1\times 10^{-2}$ and after 10 epochs it increases linearly to $1\times 10^{-1}$. It then decreases to $1\times 10^{-2}$ at 20 epochs and further to $1\times 10^{-3}$ at 60 epochs. For the margin parameters $\rho $, $\gamma $ in Eqs. (8) and (10), we affirm them to 0.3 and 0.7, respectively. For Eq. (6), following the previous works [18], $\varepsilon $ is affirmed at 0.1. For Eq. (12), $\lambda _1$, $\lambda _2$, $\lambda _3$ are 0.5, 1, 1, respectively.

Comparison with state-of-the-art methods

The results of our proposed method on the SYSU-MM01 and RegDB datasets are shown in Tables 1 and 2. The state-of-the-art methods adopted for comparison consist of two types: feature extraction-based methods (including Zero-Padding [21], HCML [10], cmGAN [24], MHM [35], BDTR [30], MAC [36], SNR [37], MSR [38], HSME [39], SSPF [40], expAT [11],DDAA [33], DFE [41], NFS [42], DCLNet [43], MPANet [23], MAUM [44]) and GAN-based methods (including D2RL [7], JSIA-ReID [6], AlignGAN [5], X-modality [25], Hi-CMD [45], DG-VAE [46], FMCNet [47], MMN [9]).

We are able to observe the following conclusions from Tables 1 and 2. (1) The proposed PSFLNet method performs significantly better than other state-of-the-art methods both on the RegDB and SYSU-MM01 datasets. It is worth noting that our PSFLNet method achieves Rank-1 accuracy = (74.0%) and mAP = (70.51%) on the SYSU-MM01 dataset (all-search settings) and Rank-1 accuracy = (79.5%) and mAP = (82.1%) in indoor-search mode. On the RegDB dataset, under the Visible-to-Infrared settings, Rank-1 accuracy = (95.87%) and mAP accuracy = (91.08%) are achieved. The results show that our PSFLNet can better learn the modality-shared features between VIS and IR modalities by using MGFL with the assistance of M-modalities. (2) All GAN-based methods use generative adversarial networks to generate new features or images to mitigate modality differences between visible and infrared images. GAN-based methods remold information at the channel and spatial level, destroying the spatial structure and leading to a wide difference between the generated images and the real ones [9]. In contrast to the GAN-based methods, PSFLNet utilizes MMG to convert visible and infrared images into middle-modality images, effectively preserving the discriminative features of the images.

Ablation study

Table 3

Ablation study of sharing parameter with different layers on SYSU-MM01 dataset

Methods				SYSU-MM01
Layer1	Layer2	Layer3	Layer4	R-1	R-10	R-20	mAP
$\checkmark $	$\checkmark $	$\checkmark $	$\checkmark $	71.11	96.36	99.01	67.24
	$\checkmark $	$\checkmark $	$\checkmark $	74.0	96.5	99.0	70.51
		$\checkmark $	$\checkmark $	73.54	96.3	98.9	69.0
			$\checkmark $	69.10	95.14	98.51	65.51

Ablation study of sharing parameters with different layers. The key to the design of a two-stream parameter-sharing network is how to split the CNN model to construct a feature extraction network with independent parameters and a feature embedding network with parameter sharing. We ablate several methods of parameter sharing on the SYSU-MM01 dataset (using all-search mode). As can be seen in Table 3, sharing layer4 as the embedding network and layer0, layer1, layer2, and layer3 as the modality-specific feature extractor. As a result, the performance is worst at this time since layer4 has relatively few parameters and cannot learn effective modality-shared features. The best performance is achieved when layer2, layer3, and layer4 are shared to construct a parameter-sharing feature embedding network, while layer0 and layer1 are used as a parameter-independent feature extraction network. In this case, the parameter-independent feature extraction network can obtain modality-related features, and the parameter-sharing feature embedding network has enough parameters to learn modality-irrelevant features. As the input to the embedding network can be a 3D feature map, more information about the spatial structure of the human body is preserved.

Influence of partition strips. In part-level feature learning, the number of partition strips concludes the granularity of a person’s local features. Figure 3 indicates the results of different partition strips p on the SYSU-MM01 datasets (using the all-search mode). We notice that when $p=4$, the best setting for partition strips to extract the local feature will be completed.

Ablation study of the designed center alignment loss. To further demonstrate that our designed center alignment loss ($\mathcal {L}_{ca}$) outperforms the other two triplet losses in terms of constraint on global features. We perform the ablation study on the SYSU-MM01 dataset (using single-shot mode at the all-search setting). Except for the loss function used to calculate the global features changed, the rest of the experimental settings keep the same. Where $\mathcal {L}_{hc\_tri}$ [19] denotes the hetero-center triplet loss, and $\mathcal {L}_{ori\_tri}$ [12] denotes the triplet loss based on hard samples. It can be observed from Table 4 that the best performance is achieved when using $\mathcal {L}_{ca}$ to compute global features. This is because for coarse-grained features, $\mathcal {L}_{ca}$ loss directly establishes relationships between classes rather than between samples, which establishes identity learning at the class level while making features with different identities discriminatory.

Table 4

Ablation study of different loss functions for the global feature on SYSU-MM01 dataset

Methods	SYSU-MM01
Methods	R-1	R-10	R-20	mAP
$\mathcal {L}_{hc\_tri}$	71.45	95.72	98.48	67.38
$\mathcal {L}_{ori\_tri}$	71.88	95.84	98.66	67.87
$\mathcal {L}_{\textbf {ca}}$	74.0	96.5	99.0	70.51

Table 5

Ablation study of different components on SYSU-MM01 dataset

Methods	SYSU-MM01
Methods	R-1	R-10	R-20	mAP
B	70.6	96.2	99.0	66.9
B+MGFL	72.20	96.24	98.9	68.52
B+TPSN	72.09	96.58	99.16	69.29
B+MGFL+TPSN	74.0	96.5	99.0	70.51

Ablation study of different components. To further reveal the contribution of each component to the devised model, we conduct an ablation study on the SYSU-MM01 dataset by adding components to the baseline network (using single-shot mode at the all-search setting). The results are shown in Table 5. Where B denotes the baseline network, which consists of an middle modal generator, a traditional two-stream network and a feature learning module. It can be seen that: (1) Our proposed multi-granularity feature learning (MGFL) module improves by 1.6% and 1.62% in Rank-1 and mAP, respectively, when using a traditional two-stream network as the backbone. This demonstrates the effectiveness of MGFL. (2) Rank-1 and mAP are improved by 1.49% and 2.39%, respectively, when using a two-stream parameter sharing network (TPSN) rather than a traditional two-stream network. This shows that the TPSN can not only preserve the body structure information but also learn more modality-sharing features. (3) By combining MGFL with TPSN, the modality differences are effectively eliminated, which leads to the best results for the proposed PSFLNet method.

Visualisation results

To further demonstrate the advantages of the proposed PSFLNet, we visualize the retrieval results of the PSFLNet on SYSU-MM01 and compare it with the baseline network (using a multi-shot setting and all-search mode). The procured Rank-10 ranking outcome are presented in Fig. 4. Comparing the retrieval results in the first row, when the search image has poor discrimination, the baseline can accurately find the correct image in Rank-6, and all Rank-1 to Rank-4 of our PSFLNet are show the correct images. It proves that our proposed PSFLNet can accurately extract the modal-independent features and more robustness. Taking an overall look at Fig. 4, we can visually see that the proposed PSFLNet method obviously improves the ranking list, allowing more properly collected images to be ranked at the top.

Conclusion

In this paper, we have introduced a parameter-sharing and feature learning network (PSFLNet) to alleviate the differences between modalities and further enhance feature discrimination for the VI-ReID task. Firstly, a new two-stream parameter-sharing network (TPSN) based on the middle modality has been proposed to explore sharing convolutional layers parameters to obtain more modality-sharing features. Then, a multi-granularity feature learning module (MGFL) has been designed to jointly use global and part-level features for cross-modality feature alignment to overcome the PCB’s lack of exploration of global information. Besides, a center alignment loss has been presented to learn the relationship between identities and cluster the features to their centers to reduce modality difference. Numerous ablation studies have been proven the effectiveness of each proposed component. Finally, extensive experiments have demonstrated the superior performance of the proposed PSFLNet method compared to state-of-the-art methods. Although we designed the MGFL module to overcome the PCB’s lack of exploration of global information, it still does not address the problem of the PCB destroying the complete semantic information. In the future, we will investigate the use of image segmentation methods to replace PCB methods for the purpose of reducing background interference. We will utilize image segmentation to obtain foreground information in pedestrian images to reduce the interference of the background and to gain more discriminative features of the human body.

Acknowledgements

This work is partially supported by the National Natural Science Foundation of China (Grant Nos. 61906168, U20A20171, 62272267, 62102227), Zhejiang Provincial Natural Science Foundation of China (Grant Nos. LY21F020027, LY23F020023), Construction of Hubei Provincial Key Laboratory for Intelligent Visual Monitoring of Hydropower Projects (2022SDSJ01) and Key Programs for Science and Technology Development of Zhejiang Province (2022C03113).

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Three-way decision based on ITARA and public weights DEA under picture fuzzy environment and its application in new energy vehicles selection

next article An efficient long-text semantic retrieval approach via utilizing presentation learning on short-text

Sreenu G, Durai MAS (2019) Intelligent video surveillance: a review through deep learning techniques for crowd analysis. J Big Data 6:48. https://doi.org/10.1186/s40537-019-0212-5. Springer

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90. IEEE

Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi SCH (2022) Deep learning for person re-identification: a survey and outlook. IEEE Trans Pattern Anal Mach Intell 44(6):2872–2893. https://doi.org/10.1109/TPAMI.2021.3054775. IEEE

Wu A, Zheng W-S, Yu H-X, Gong S, La J (2017) Rgb-infrared cross-modality person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp. 5380–5389. IEEE

Wang G, Zhang T, Cheng J, Liu S, Yang Y, Hou Z (2019) Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In: Proceedings of the IEEE/CVF International conference on computer vision, pp. 3623–3632. https://doi.org/10.1109/ICCV.2019.00372. IEEE

Wang G-A, Zhang T, Yang Y, Cheng J, Chang J, Liang X, Hou Z-G (2020) Cross-modality paired-images generation for rgb-infrared person re-identification. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12144–12151. AAAI. https://ojs.aaai.org/index.php/AAAI/article/view/6894

Wang Z, Wang Z, Zheng Y, Chuang Y, Satoh S (2019) Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp 618–626. https://doi.org/10.1109/CVPR.2019.00071. IEEE

Chan S, Du F, Lei Y, Lai Z, Mao J, Li C, et al (2022) Learning identity-consistent feature for cross-modality person re-identification via pixel and feature alignment. Mobile Inform Syst 2022 . Hindawi

Zhang Y, Yan Y, Lu Y, Wang H (2021) Towards a unified middle modality learning for visible-infrared person re-identification. In: Proceedings of the 29th ACM international conference on multimedia, pp. 788–796 . https://doi.org/10.1145/3474085.3475250

10.

Ye M, Lan X, Li J, Yuen PC (2018) Hierarchical discriminative learning for visible thermal person re-identification. In: Proceedings of the AAAI conference on artificial intelligence, pp. 7501–7508. AAAI. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16734

11.

Ye H, Liu H, Meng F, Li X (2021) Bi-directional exponential angular triplet loss for rgb-infrared person re-identification. IEEE Trans Image Process 30:1583–1595. https://doi.org/10.1109/TIP.2020.3045261. IEEE

12.

Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. CoRR abs/1703.07737arXiv:1703.07737

13.

Zhu Y, Yang Z, Wang L, Zhao S, Hu X, Tao D (2020) Hetero-center loss for cross-modality person re-identification. Neurocomputing 386:97–109. https://doi.org/10.1016/j.neucom.2019.12.100CrossRef

14.

Ramanan M, Singh L, Kumar AS, Suresh A, Sampathkumar A, Jain V, Bacanin N (2022) Secure blockchain enabled cyber-physical health systems using ensemble convolution neural network classification. Comput Electr Eng 101:108058CrossRef

15.

Sonawane VR, Patil SB (2023) Track and hunt metaheuristic based deep neural network based fault diagnosis model for the voltage source inverter under varying load conditions. Adv Eng Softw 177:103414CrossRef

16.

Sun Y, Zheng L, Yang Y, Tian Q, Wang S (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: Proceedings of the European conference on computer vision (ECCV), pp 480–496. https://doi.org/10.1007/978-3-030-01225-0_30. ECCV

17.

Ling Y, Zhong Z, Cao D, Luo Z, Lin Y, Li S, Sebe N (2022) Cross-modality earth mover’s distance for visible thermal person re-identification. CoRR abs/2203.01675arXiv:2203.01675

18.

Luo H, Gu Y, Liao X, Lai S, Jiang W (2019) Bag of tricks and a strong baseline for deep person re-identification. In: IEEE Conference on computer vision and pattern recognition workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 1487–1495. https://doi.org/10.1109/CVPRW.2019.00190. IEEE

19.

Liu H, Tan X, Zhou X (2021) Parameter sharing exploration and hetero-center triplet loss for visible-thermal person re-identification. IEEE Trans Multimedia 23:4414–4425. https://doi.org/10.1109/TMM.2020.3042080. (IEEE)CrossRef

20.

Nguyen DT, Hong HG, Kim K, Park KR (2017) Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors 17(3):605. https://doi.org/10.3390/s17030605ADSCrossRefPubMedPubMedCentral

21.

Wu A, Zheng W-S, Yu H-X, Gong S, Lai J (2017) Rgb-infrared cross-modality person re-identification. In: Proceedings of the IEEE international conference on computer vision, pp. 5380–5389. IEEE

22.

Hou S, Yin K, Liang J, Wang Z, Pan Y, Yin G (2022) Gradient-supervised person re-identification based on dense feature pyramid network. Complex Intell Syst 1–14. Springer

23.

Wu Q, Dai P, Chen J, Lin C-W, Wu Y, Huang F, Zhong B, Ji R (2021) Discover cross-modality nuances for visible-infrared person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4330–4339. https://doi.org/10.1109/CVPR46437.2021.00431. IEEE

24.

Dai P, Ji R, Wang H, Wu Q, Huang Y (2018) Cross-modality person re-identification with generative adversarial training. In: Lang, J. (ed.) Proceedings of the twenty-seventh international joint conference on artificial intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 677–683. https://doi.org/10.24963/ijcai.2018/94. IJCAI

25.

Li D, Wei X, Hong X, Gong Y (2020) Infrared-visible cross-modal person re-identification with an X modality. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, The thirty-second innovative applications of artificial intelligence conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 4610–4617. AAAI. https://ojs.aaai.org/index.php/AAAI/article/view/5891

26.

Feng Y, Chen F, Yu J, Ji Y, Wu F, Liu S (2021) Homogeneous and heterogeneous relational graph for visible-infrared person re-identification. CoRR abs/2109.08811arXiv:2109.08811

27.

Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in Neural Information Processing Systems 31: annual conference on neural information processing systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada 31, 8792–8802

28.

Liu H, Cheng J, Wang W, Su Y, Bai H (2020) Enhancing the discriminative feature learning for visible-thermal cross-modality person re-identification. Neurocomputing 398:11–19. https://doi.org/10.1016/j.neucom.2020.01.089CrossRef

29.

Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386CrossRef

30.

Ye M, Wang Z, Lan X, Yuen PC (2018) Visible thermal person re-identification via dual-constrained top-ranking. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 1092–1099. https://doi.org/10.24963/ijcai.2018/152. IJCAI

31.

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, pp 2818–2826. https://doi.org/10.1109/CVPR.2016.308. IEEE

32.

Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255 . IEEE

33.

Ye M, Shen J, J Crandall D, Shao L, Luo J (2020) Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In: European conference on computer cision, pp 229–247 . Springer

34.

Zhong Z, Zheng L, Kang G, Li S, Yang Y (2020) Random erasing data augmentation. In: Proceedings of the AAAI conference on artificial intelligence, 34, pp 13001–13008. AAAI

35.

Yang F, Wang Z, Xiao J, Satoh S (2020) Mining on heterogeneous manifolds for zero-shot cross-modal image retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12589–12596. AAAI. https://ojs.aaai.org/index.php/AAAI/article/view/6949

36.

Ye M, Lan X, Leng Q, Shen J (2020) Cross-modality person re-identification via modality-aware collaborative ensemble learning. IEEE Transactions on image processing 29:9387–9399. https://doi.org/10.1109/TIP.2020.2998275. IEEE

37.

Jin X, Lan C, Zeng W, Chen Z, Zhang L (2020) Style normalization and restitution for generalizable person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3143–3152 . https://doi.org/10.1109/CVPR42600.2020.00321. IEEE

38.

Feng Z, Lai J, Xie X (2019) Learning modality-specific representations for visible-infrared person re-identification. IEEE Trans Image Process 29:579–590. https://doi.org/10.1109/TIP.2019.2928126. (IEEE)ADSMathSciNetCrossRef

39.

Hao Y, Wang N, Li J, Gao X (2019) Hsme: Hypersphere manifold embedding for visible thermal person re-identification. In: Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp 8385–8392 . https://doi.org/10.1609/aaai.v33i01.33018385. AAAI

40.

Lu Y, Wu Y, Liu B, Zhang T, Li B, Chu Q, Yu N (2020) Cross-modality person re-identification with shared-specific feature transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13379–13389 . IEEE

41.

Hao Y, Wang N, Gao X, Li J, Wang X (2019) Dual-alignment feature embedding for cross-modality person re-identification. In: Proceedings of the 27th ACM international conference on multimedia. pp 57–65 . https://doi.org/10.1145/3343031.3351006

42.

Chen Y, Wan L, Li Z, Jing Q, Sun Z (2021) Neural feature search for rgb-infrared person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 587–597 . https://doi.org/10.1109/CVPR46437.2021.00065. IEEE

43.

Sun H, Liu J, Zhang Z, Wang C, Qu Y, Xie Y, Ma L (2022) Not all pixels are matched: Dense contrastive learning for cross-modality person re-identification. In: Proceedings of the 30th ACM international conference on multimedia, pp 5333–5341. https://doi.org/10.1145/3503161.3547970

44.

Liu J, Sun Y, Zhu F, Pei H, Yang Y, Li W (2022) Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp. 19366–19375 . IEEE

45.

Choi S, Lee S, Kim Y, Kim T, Kim C (2020) Hi-cmd: Hierarchical cross-modality disentanglement for visible-infrared person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10257–10266. https://doi.org/10.1109/CVPR42600.2020.01027. IEEE

46.

Pu N, Chen W, Liu Y, Bakker EM, Lew MS (2020) Dual gaussian-based variational subspace disentanglement for visible-infrared person re-identification. In: Proceedings of the 28th ACM international conference on multimedia, pp. 2149–2158

47.

Zhang Q, Lai C, Liu J, Huang N, Han J (2022) Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7349–7358 . IEEE

Title: Parameter sharing and multi-granularity feature learning for cross-modality person re-identification
Authors: Sixian Chan
Feng Du
Tinglong Tang
Guodao Zhang
Xiaoliang Jiang
Qiu Guan
Publication date: 14-08-2023
Publisher: Springer International Publishing
Published in: Complex & Intelligent Systems / Issue 1/2024
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-023-01189-y

Springer Professional

Parameter sharing and multi-granularity feature learning for cross-modality person re-identification

Abstract

Publisher's Note

Introduction

Network designing

Feature learning

Metric learning

Methods

Model construction

Middle modality generator (MMG)

Multi-granularity feature learning (MGFL)

Loss optimization

Experiments

Datasets and experimental settings

Comparison with state-of-the-art methods

Ablation study

Visualisation results

Conclusion

Acknowledgements

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Network designing

Feature learning

Metric learning

Methods

Model construction

Middle modality generator (MMG)

Two-stream parameter sharing network (TPSN)

Multi-granularity feature learning (MGFL)

Loss optimization

Experiments

Datasets and experimental settings

Comparison with state-of-the-art methods

Ablation study

Visualisation results

Conclusion

Acknowledgements

Publisher's Note

Other articles of this Issue 1/2024

Kernel-mask knowledge distillation for efficient and accurate arbitrary-shaped text detection

DIFLD: domain invariant feature learning to detect low-quality compressed face forgery images

Applying particle swarm optimization-based dynamic adaptive hyperlink evaluation to focused crawler for meteorological disasters

A deep learning model for steel surface defect detection

Smart cities: the role of Internet of Things and machine learning in realizing a data-centric smart environment

Computer vision-based hand gesture recognition for human-robot interaction: a review

Premium Partner