Top

Complex & Intelligent Systems

Published in:

Open Access 24-07-2023 | Original Article

DIFLD: domain invariant feature learning to detect low-quality compressed face forgery images

Authors: Yan Zou, Chaoyang Luo, Jianxun Zhang

Published in: Complex & Intelligent Systems | Issue 1/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

With the rapid development of deep learning, face forgery detection methods have also achieved remarkable progress. However, most methods suffer significant performance degradation on low-quality compressed face images. It is due to: (a) The image artifacts will be blurred in the process of image compression, resulting in the model learning insufficient artifact traces; (b) Low-quality images will introduce a lot of noise information, and minimizing the training error causes the model to absorb all correlations in the training dataset recklessly, leading to the over-fitting problem. To solve the above problems, we consider learning domain invariant representations to inscribe the correct relevance, i.e., artifacts, to improve the robustness of low-quality images. Specifically, we propose a novel face forgery detector, called DIFLD. The model has the following components: (1) a high-frequency invariant feature learning module(hf-IFLM), which effectively retrieves the blurred artifacts in low-quality compressed images; and (2) a high-dimensional feature distribution learning module(hd-FDLM), that guides the network to learn more about the consistent features of distribution. With the above two modules, the whole framework can learn more discriminative correct artifact features in an end-to-end manner. Through extensive experiments, we show that our proposed method is more robust to image quality variations, especially in low-quality images. Our proposed method achieves a 3.67% improvement over the state-of-the-art methods on the challenging dataset NeuralTextures.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Human faces play an important role in our daily life, such as access control [1] and face payment [2]. However, in recent years, deep learning techniques [3‐5] have been misused in the production of forgery faces, which has led to the proliferation of faked videos and images on the Internet, represented by “Deepfakes” [6]. By tampering with or replacing the face information of original videos, these deep forgery technologies reduce or even distort the authenticity of the information we see online. This not only affects companies and celebrities but also poses a great threat to the lives and work of ordinary people. Figure 1 shows some faked face images generated by tampering with original real images through deep forgery techniques. (A) and (C) are the ex-US President Barack Obama, created by Buzzfeed¹ in collaboration with Monkeypaw Studios; (B) and (D) are the faked video² frames of ordinary people. These generated face forgery images are nearly free of forgery traces, and it is difficult for humans to determine the authenticity of these images with the naked eyes. Therefore, it becomes more and more urgent to develop effective detection methods.

Toward such a concern, many methods [7‐15] have been proposed successively. Traditional methods focus on designing non-learning algorithms to extract forged features from images, to discover the differences between real and tampered faces. These features could be divided into color features [10, 11], biological features [12, 13], and other discriminative features. Ultimately, the handcrafted features are fed into the machine learning algorithms for classification. Although traditional detection methods are well established, they are limited to small- and medium-scale datasets. In today’s era of big data explosion, traditional methods lose their advantages of high efficiency. Second, with the development of deep forgery techniques, fake faces are becoming more and more realistic, which presents a great challenge [16] to traditional methods based on specific artifacts.

To capture more subtle differences between real and fake images, later works focus on using deep convolutional neural networks (CNNs) [17] to learn discriminative features from training datasets to achieve face forgery detection. Using a backbone of ready-made CNNs, these methods take facial images directly as input and then classify them as real or fake. These vanilla CNNs, however, tend to look for forgeries in a limited region of faces, indicating that the detectors lack the understanding of forgery [18]. Since then, the mainstream of the research has gradually shifted to improving face forgery detection based on specific prior knowledge by optimizing network structures [14, 19‐21] or designing corresponding end-to-end learning frameworks [15, 22, 23]. Though recent works have reached sound achievements, low-quality datasets can easily lead to the failure of existing methods. That is because: (a) They always rely on forgery patterns that are possessed by a certain manipulation technique presented in the training datasets, but low-quality images will blur out the artifacts. (b) Compression of images will introduce a lot of extraneous noise information, but current methods do not notice it. Thus, in the real world, the available methods usually do poorly to achieve the desired results, due to face forgery images shared on social media are usually compressed.

As is shown in Table 1, by extensive comparative experimental analysis we found, that low-quality(LQ) face forgery detection is prone to the over-fitting problem. Inspired by the literature [24], we attribute this problem to the fact that minimizing the training error leads the model to learn all correlations in the training data regardless of the consequences. However, LQ images blur the boundaries between true and false artifacts, making the model more susceptible to selectivity bias and confounding factors. The consequence is that the model fails to learn sufficiently adequate correct causality, but also learns spurious correlations, i.e., non-artifacts. To address these issues, we propose using domain invariant features inscribe the correct causality, mitigate the over-reliance of the model on data bias, alleviate the over-fitting problem, and improve fake face detection on LQ images. Specifically, first, for the intermediate feature layer, we design high-frequency invariant feature learning modules. Second, for the high-dimensional feature layer, we carry out the high-dimensional feature distribution invariant loss design. The two modules enable the model to learn the correct artifact traces in an end-to-end manner.

Table 1

Comparative experimental results of train and test datasets on NeuralTextures RAW (high-quality) and C40 (low-quality) datasets from FaceForensics++ [25], report on Accuracy (Acc)

NeuralTexture	C40	RAW
Train dataset	98.50%	99.50%
Test dataset	79.08%	97.12%

With the above considerations in mind, in this paper, we present a novel face forgery detector framework. Our contributions in this paper are summarized as follows:

This paper is the first to use domain invariance to inscribe the correct artifacts and alleviate the over-fitting problem on low-quality compressed images.
The hf-IFLM and hd-FDLM modules are proposed to learn domain invariant feature representation. They correspond to high-frequency invariant feature learning in the middle layer and high-dimensional feature distribution learning in the output layer, respectively.
Compared to the commonly used frequency domain analysis, we use the haar wavelet transform to exploit both spatial and frequency information, and to learn the common invariant feature representation more sufficiently by designing a multi-scale and multi-angle learning module.

The remainder of this paper is organized as follows: Section Related work reviews some preliminaries of the proposed algorithm. Section Methods shows the architecture of the DIFLD and proposed algorithm in detail. Section Experimental presents the comparison and ablation experimental results and analysis. Finally, this paper is summarized in Sect. Conclusion.

Face forgery detection. Nowadays, face forgery detection has drawn significant attention, as it is related to the protection of personal privacy. So far, many face forgery detection methods have been proposed by academia and industry fields, which have contributed greatly to the research in this direction. Such as paper [13] proposed to judge the real and fake images based on the shape of the pupils. For a person, the shape of the pupils is generally elliptical, while the authors find that the faces generated using techniques such as GAN [3] have an irregular pupil shape. In [26], blink frequency was used for fake face detection. However, these methods [10‐13, 27, 28] need extract handcrafted features, it is not only time-consuming and labor-intensive but also very challenging. Later, with the development of deep learning, forged face detection researchers begin to shift to deep learning methods [14, 29], aiming to achieve more flexible and reliable detection through dynamic feature learning.

For example, Shreyan Ganguly [14] proposes a Vision Transformer with Xception Network(ViXNet), which to learn the consistency of these almost imperceptible artifacts left by face forgery methods on the entire facial region. Nevertheless, they are insufficient to cope with low-quality images and do not have a high practical value. To counter the problem, various additional information is used to enhance performance. Most of them are based on the spatial domain such as RGB and HSV. For example, Some approaches [12, 13] exploit specific artifacts arising from the synthesis process such as color or shape cues. Wang [29] propose a framework, multimodal contrast classification local correlation representation (MC-LCR), for effective face forgery detection. Instead of specific appearance features, MC-LCR amplifies the implicit local differences between real and forged faces from the spatial and frequency domains. Based on the complementary nature of amplitude and phase information, they develop a patch-wise amplitude and phase dual attention module to capture locally relevant inconsistencies in the frequency domain. But their effectiveness is limited to the datasets on which they are specially trained. In order to be more resilient, we turn to learning domain invariant features to inscribe causality, limiting network learning of spurious artifacts to improve the accuracy of detection on low-quality images.

Domain generalization. Traditional machine(ML) learning models are trained based on the i.i.d. assumption that training and testing sets are identically and independently distributed. However, this assumption does not always hold in reality. When the probability distributions of training data and testing data are different, the performance of ML models often deteriorates due to domain distribution gaps [30]. Collecting the data of all possible domains to train ML models is expensive and even prohibitively impossible. Therefore, enhancing the generalization ability of ML models is important in both industry and academic fields. The goal of domain generalization is to learn a model from one or several different but related domains (i.e., diverse training datasets) that will generalize well on unseen testing domains. That is, to train a good ML model that has minimum prediction error in classifying images coming from testing domains.

Most of the domain generalization methods belong to domain feature alignment schemes, whose core idea is to learn domain invariant representations as a way to minimize the differences between source domains. In this work, due to the difference in resolution, we treat the Raw dataset and the C40 dataset as two different domains, which are our training datasets, as shown in Fig. 2. And we learn domain invariant features by invariant risk minimization [24]. The goal of representation learning can be formulated as Eq. 1:

$$\begin{aligned} \min _{f, g} {\mathbb {E}}_{{\textbf{x}}, y} \ell (f(g({\textbf{x}})), y)+\lambda \ell _{\textrm{reg}} \end{aligned}$$

(1)

Where g is a representation learning function, f is the classifier function, $\ell _{reg}$ denotes some regularization term, $\lambda $ is the trade-off parameter, ${\mathbb {E}}$ represents expectation, and $\ell (\cdot )$ is loss function for classification.

Wavelet transform. Wavelet transform (WT) [31, 32] is a new transform analysis method, compared to the Fourier transform [33, 34], where the signal is expressed as a sum of a sine function and a cosine function. The main difference is that wavelets are localized in both the time and frequency domains while the standard Fourier transform is localized only in the frequency domain. The major characteristics of WT are the ability to fully highlight certain aspects of the problem through transformations, the ability to localize the analysis of temporal and spatial frequencies, and the gradual multi-scale refinement of signals through telescoping translation operations.

Overall, WT is a multi-scale, multi-resolution decomposition of an image that can be focused on arbitrary details of the image and is known as a mathematical microscope. Based on microscopic properties of WT, this paper uses WT to focus on the falsification details in the image.

Methods

In this work, we proposed a deep-learning-based classification model, called DIFLD, which can discriminate manipulated facial images generated by different deep face forgery generation techniques, from the real ones. The working principle of the proposed model is summarized pictorially in Fig. 4. It comprises two main modules: (1) hf-IFLM, described in Sect. hf-IFLM, (2) hd-FDLM, described in Sect. hd-FDLM.

Overview

As shown in Fig. 4, in order to improve the detection accuracy of low-quality forged images, we propose two cooperative modules from two different perspectives by learning domain-invariant features, which are hf-IFLM and hd-FDLM. First, both the C40 dataset and the raw dataset mentioned in this paper are the original uncompressed dataset and the highly compressed dataset from the FaceForenscis++ dataset. From the perspective of learning more artifact traces, we design the hf-IFLM module. By forwarding a C40 compressed input image and its corresponding Raw image through the backbone network, we extract corresponding truncated feature maps ${\widetilde{X}}_i(c,x,y) \in {\mathbb {R}}^{C \times W \times H}$ and ${\widehat{X}}_i(c,x,y) \in {\mathbb {R}}^{C \times W \times H}$ from Stage i of ResNet-50 [35]. Here, for convenience, we use the notation ${\varvec{X_i}}$ to denote the group ${\widetilde{X}}_i(c,x,y)$ and ${\widehat{X}}_i(c,x,y)$. Then the ${\varvec{X_i}}$ is passed forward to the Subtra module, in which we use multi-angle high-frequency information to compensate the missing features $Loss\_i$ that are discarded during compression. And from the perspective of learning the correct artifact traces, eliminating non-fake traces. In the hd-FDLM module, we obtained two distributions $P_{raw}$ and $P_{c40}$ by computing the distribution of the backbone network output layer feature maps ${\widetilde{F}}_{hd}$ and ${\widehat{F}}_{hd}$, i.e., ${\varvec{F\_hd}}$. Then, the distribution distance ${\mathcal {L}}_{hd}$ is calculated to constrain the similarity between Raw and C40. In this way, the network can learn more distribution consistency information, which is used to portray the correct artifact traces, and remove the noise information during the training process.

The representations ($Loss\_1$, $ Loss\_2$, $Loss\_3$, $ Loss\_4$, ${\varvec{F\_{hd}}}$) extracted from the different modules are learned in an end-to-end manner by designing loss functions, where the training process is guided by joint losses ${\mathcal {L}}_{cls}$, ${\mathcal {L}}_{hf}$ and ${\mathcal {L}}_{hd}$. The overall learning of DIFLD is given in the corresponding Algorithm 1.

hf-IFLM

Motivation. In the literature [11, 13], face forgery detection methods are usually using spatial domain features. However, frequency domain features are equally important. In [36], the authors showed that the spectrum-based classifier performs better than the pixel-based classifier in detecting forged images. The authors designed GAN for synthesizing artifacts and added up-sampling components to GAN for detecting these artifacts. Applications of frequency domain features can also be seen in other authentication scenarios, such as smartphone user detection. In [37], the authors design a dual-stream network that uses frequency domain features to verify legitimate users or impostors on smartphones. Similarly, in [38] the authors designed a frequency-based approach to verify the security of smartphones. Thus, inspired by the successful use of frequency domain features described above, we propose to use frequency information to detect faked faces.

As shown in Fig. 3, the synthetic face region (in the red box area) and the background region present obviously different distribution in high-frequency maps. It indicates that high-frequency information of images can capture forgery traces very well [23, 39]. Moreover, as shown in the last row of Fig. 3, the artifacts of generated facial images are blurry while the images are compressed, which prevents the network from learning artifacts. But we discover from the top row, the traces of artificial forgery are clearly visible. Thus, to compensate for the lack of low-quality (C40) images, we attempt to exploit artifacts with high-frequency information from high-quality (Raw) images.

It is well known that the shallow layers of DNNs tend to low-level features (e.g., color and texture), while the deeper layers tend to be high-level abstract features [29]. Therefore, in order to extract more sufficient artificial traces of forgery, we carried out high-frequency artifact extraction for Stage 1–4 of our backbone network. Besides, as is shown in Fig. 3, the high-frequency information in horizontal, vertical, and diagonal directions can capture artifacts well. So, we use all three for feature extraction simultaneously. In summary, we design a multi-scale and multi-angle feature extraction module for high-frequency information, as shown in the top stream of Fig. 4.

In order to convert spatial domain information into frequency domain information, we transform the output of Stage 1–4 of ResNet-50. Compared with the conventional use of discrete Fourier transform (DFT) [32], our work uses Haar wavelet transform (WT) [33] because the WT is replacing a trig basis of infinite length with a wavelet basis of finite length that will decay. This allows the network to capture not only the frequency information but also the time information.

Design of hf-IFLM. First, by forwarding a C40 compressed input image and its corresponding Raw image through the backbone network, we obtain features ${\widetilde{X}}_i(c,w,h) \in {\mathbb {R}}^{C \times W \times H}$ denotes the feature map of output of Raw datasets from the Stage i layer of backbone, and ${\widehat{X}}_i(c,w,h) \in {\mathbb {R}}^{C \times W \times H}$ denotes the feature map of output of C40 datasets, which have C channels, the width of W, and the height of H. For these outputs of the backbone network, we use the following Eq. 2 for wavelet decomposition. In Fig. 5, we visualized the process diagram of Haar wavelet decomposition for two-dimensional images.

$$\begin{aligned} {\mathcal {H}}_{{\widehat{X}}_i/{\widetilde{X}}_i}(c,a,b) = \Im ({\widehat{X}}_i/{\widetilde{X}}_i(c,w,h)), i=1,2,3,4 \end{aligned}$$

(2)

where c, w, and h denote the $c_{th}$, $x_{th}$ and $y_{th}$ slice in the channel, the width, and the height dimension of ${\widetilde{X}}_i(c,w,h) $ and ${\widehat{X}}_i(c,w,h)$, respectively. $\Im (\cdot )$ denotes wavelet transform. ${\mathcal {H}}_{{\bar{\chi }}_i}(c,a,b)$ is composed of horizontal high-frequency information ${{\mathcal {H}}}_{{\bar{\chi }}_i}(c,a,b)_H$, vertical high-frequency information ${{\mathcal {H}}}_{{\bar{\chi }}_i}(c,a,b)_V$, and diagonal high-frequency information ${{\mathcal {H}}}_{{\bar{\chi }}_i}(c,a,b)_D$. ${\bar{\chi }}_i$ for the corresponding ${\widetilde{X}}_i$ or ${\widehat{X}}_i$, and a and b are equal to half of x and y, respectively.

After obtaining high-frequency information about two different resolution datasets, as is shown in Fig. 6, take the vertical direction of Stage i for example, $i=1,2,3,4$. We assume that the high-frequency difference can offset the artifact loss of C40 datasets due to compressed data. The specific calculation formula is as follows:

$$\begin{aligned} {\mathcal {D}}^H_i(c,a,b)= \sum _{c,a,b}^{C,A,B}d({{\mathcal {H}}}_{{\widehat{\chi }}_i}(c,a,b)_H, {{\mathcal {H}}}_{{\widetilde{\chi }}_i}(c,a,b)_H) \end{aligned}$$

(3)

In order to make the difference matrix ${\mathcal {D}}^H_i(c,a,b)$ work better, we design an attention weight $\omega (a,b)$ using the human property of recognizing objects through the naked eye, i.e., the surrounding pixel points have an impact on the recognition of the current pixel point. Therefore, as calculated in Eq. 4, we use the cross-channel difference indices of four neighboring pixel points as weights.

$$\begin{aligned} \omega (a,b)=exp(\gamma _{hf}\cdot \frac{1}{C}\sum _{c=1}^{C}d({{\mathcal {H}}}_{{\widehat{\chi }}_i},{{\mathcal {H}}}_{{\widetilde{\chi }}_i})), i=1,2,3,4\nonumber \\ \end{aligned}$$

(4)

Here $\gamma _{hf}$ is the mean value of the four adjacent pixel points. This design of attention weights ensures that the model pays more attention to the high frequency of loss.

In the end, this paper presents a method to realize multi-scale and multi-angle invariant feature learning. We accumulate losses from Stage 1–4 in several directions above in Eq. 5.

$$\begin{aligned} {\mathcal {L}}_{hf} = \sum _{i=1}^{4}\sum _{j}^{\{H,V,D\}}{\varvec{\delta }}\cdot {\varvec{\eta }}\cdot \omega (a,b)\cdot {\mathcal {D}}^j_i(c,a,b) \end{aligned}$$

(5)

Among them, ${\varvec{\delta }}$ and ${\varvec{\eta }}$ are vector weight parameters of different angles and scales.

The whole module is trained in an end-to-end way. By $ {\mathcal {L}}_{hf}$ minimizing the training errors of Raw and C40 datasets, the network learns more high-frequency domain invariant features to cope with low-quality forged face detection.

hd-FDLM

Motivation. Thanks to the paper [40], the author proposes a novel Distribution Distillation Loss to narrow the performance gap between easy and hard samples. The main idea of this paper is to construct two similarity distributions: a teacher distribution from easy samples and a student distribution from hard samples, and design a novel distribution-driven loss to constrain the student distribution to approximate the teacher distribution, which thus leads to smaller overlap between the positive and negative pairs in the student distribution.

In this paper, we utilize a similar idea for fake face detection of low-quality images. As shown in Fig. 7, we find a significant difference in the distribution of Raw and C40 images after the backbone. Therefore, we intend to use two different feature distribution information to learn domain invariant features, thus reducing the overlap of real and false artifacts in the low-quality distribution, enabling the model to learn the correct correlation.

In order to construct probability distributions of Raw and C40 datasets, we use T-SNE [41] to represent the distribution of features between Raw and C40. Because that the T-SNE algorithm not only can convert the data into a low-dimensional distribution that is easier to manipulate, but we can also preserve the local characteristics of the data. The data that were close together should be close together after conversion, and the data that were far away should be far away after conversion.

Design of hd-FDLM. Since the high-dimensional features [9] can extract artifact traces with more subtlety, this facilitates our detection. As shown in Fig. 8, by forwarding a C40 compressed input image and its corresponding Raw image through the backbone network, we obtain features ${\widetilde{F}}_{hd}$ and ${\widehat{F}}_{hd}$ from the high-dimensional layer, correspondingly. Then, after putting ${\widetilde{F}}_{hd}$ and ${\widehat{F}}_{hd}$ through global average pooling and fully connected layers, the Sigmoid function [42] is computed for each channel, and the Sigmoid value of each convolved channel is obtained.

The T-SNE computational distribution is calculated as follows— Eq. 6:

$$\begin{aligned} P(j \mid i)=\frac{{\mathcal {S}}\left( x_{i}, x_{j}\right) }{\sum _{k \ne i} {\mathcal {S}}\left( x_{i}, x_{k}\right) }, j \ne i, i=1,2, \ldots \end{aligned}$$

(6)

Among them, ${\mathcal {S}}\left( x_{i}, x_{j}\right) $ is the similarity between data i and data j. The closer the distance, the more similar it is. In this paper, we use Euclidean distance [43] to calculate the similarity between feature maps, as shown in the Eq. 7. Suppose there are n data, then we will define n probability distributions for these n data.

$$\begin{aligned} {\mathcal {S}}\left( x_{i}, x_{j}\right) =\sqrt{\left( x_{i}-x_{j}\right) ^{2}} \end{aligned}$$

(7)

After getting the distribution of the two data streams, we denote the distribution of Raw dataset as $P_{raw}$, and the distribution of C40 dataset denoted as $P_{c40}$. To narrow the performance gap between the C40 and Raw datasets, we constrain the similarity distribution of C40 samples to approximate the similarity distribution of Raw samples. Motivated by the previous Kullback–Leibler divergence (KL) [40, 44], we adopt the KL divergence to constrain the similarity between the Raw and C40 distributions, which is defined as follows:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{hd}&= D_{KL}\left( P_{raw}, P_{c 40}\right) \\&=\sum _{i=1}^{n} P_{raw}(x_i)\log \frac{P_{raw}(x_i)}{P_{c40}(x_i)} \end{aligned} \end{aligned}$$

(8)

where n denotes the length of one batch data. Ultimately, the whole framework exploits the common features from the distribution information of different data streams. Our method incorporates a novel distribution loss to let the network learn more about the correct artifacts.

And the entire network is trained in an end-to-end manner by jointly minimizing the classification loss ${\mathcal {L}}_{cls}$, the high-frequency loss ${\mathcal {L}}_{hf}$, and the distributional-learning loss ${\mathcal {L}}_{hd}$. The two modules synergistically complement the domain invariant features by minimizing the training error of raw and C40 data streams, allowing the network to learn more and more adequate forgery traces, and enabling the network to learn more correct artifacts to improve forged face detection.

Experimental

In this section, the performance of our method is evaluated. In Section Experiment settings, we first introduce the details of the experimental datasets and describe how to prepare training and testing datasets from these datasets. In addition, we further discuss the experimental setups. Then in Section Detection performance, we present extensive experimental results compared with the state-of-the-art (SOTA) methods to demonstrate the superiority of the proposed method. Finally, in Section Ablation study and discussions, the effectiveness of the hf-IFLM and the hd-FDLM module is verified by ablation experiments.

Experiment settings

Datasets. Following previous face forgery detection methods [11, 14, 15, 19‐23], we conduct our experiments on the challenging FaceForensics++ (FF++) [25] dataset, released in 2019. It is a large-scale forensics dataset consisting of 1000 real human face videos from 977 YouTube videos, and 4000 fake human face videos obtained by four face manipulation techniques: DeepFakes [6], Face2Face [45], FaceSwap [46] and NeuralTextures [47]. It is worth noting that the AI-synthesized videos spreading on social network are usually compressed. So, these videos (RAW) are compressed into two versions in order to simulate the realistic forensic scenario: medium compression (C23) and high compression (C40), using the H.264 codec with a constant rate quantization parameter of 23, and 40, respectively. Because the benchmark tests have achieved nearly perfect detection performance in the RAW version, our method aims to the compressed versions, mainly use the C40 version.

We constructed 15 mutually exclusive datasets containing 1000 videos, respectively. Following the division of FF++, which divides each dataset into train, validate, and test datasets containing 720, 140, and 140 videos, respectively. For all videos, we crop the face region by the official given mask instead of using MTCNN [48], which may crop the non-artifactual region. Then we save the aligned facial images as inputs with a size of 128 $\times $ 128 pixels. By sampling each video at an interval of 30 frames, we obtain 15 datasets containing 30,000 images.

Implementation and hyper-parameters In the experiments, We use the PyTorch [49] framework to implement our method, and ResNet50 [35] as the backbone. We use the Adam optimizer [50] with $\beta _1 = 0.9$, $\beta _2 = 0.999$, $\epsilon = 10^{-8}$. The learning rate is initialized to 1 $\times $ $10^{-4}$ and weight decay is 1 $\times $ $10^{-8}$. The above experimental setup follows the baseline experimental setup. During the training stage, the batch size is set to 48. In every epoch, the model is validated 10 times to save the best parameters using validation accuracy. Early stopping is applied when the validation performance does not improve after 10 consecutive times. Through numerous parameter experiments, we finally determined the high-dimensional feature distribution learning module with weight parameter 1, and the high-frequency invariant feature learning module with weight parameter 16, BCEloss weight parameter is 1. In the following experiments, we use the accuracy score ACC (%) as our evaluation metric. We train our models on RTX 3070 GPUs.

Detection performance

Table 2

Experimental results (Acc%) of our proposed method and other eight different baseline approaches on four different deepfake datasets. The best results among all the methods are denoted in bold

Methods	C40				C23
Methods	DeepFake	Face2Face	FaceSwap	NT	DeepFake	Face2Face	FaceSwap	NT
Rössler et al. [25]	92.43	80.21	88.09	56.75	97.42	91.83	95.49	76.36
Dogonadze et al. [51]	93.97	83.44	90.02	61.12	94.67	89.34	93.33	78.03
$F^3Net$ [52]	93.06	81.48	89.58	61.95	96.26	95.52	95.74	77.91
ResNet50 [35]	92.89	83.94	88.91	60.27	96.34	95.60	92.46	86.25
FitNet-ResNet50 [53]	93.68	83.48	89.16	66.01	97.28	95.91	97.29	86.26
AT - ResNet50 [54]	95.11	83.55	89.75	62.61	97.37	96.80	97.66	85.21
NL - ResNet50 [55]	93.09	83.69	91.86	65.65	98.42	96.44	97.34	88.26
ADD-ResNet50 [23]	95.50	85.42	92.49	68.53	98.67	96.82	97.85	88.48
ResNet50 + Face-Cutout [56]	95.86	89.14	93.02	78.34	98.90	97.08	97.89	91.15
MCX-API [57]	95.71	87.12	92.83	78.79	98.73	96.98	97.91	89.96
DIFLD-ResNet50(ours)	95.80	90.25	93.08	82.01	98.89	97.13	97.94	92.26

The differences between face forgery datasets mainly lie in the variations of source videos and face manipulation methods. To evaluate the cross-manipulations generalization capability of different face forgery detectors and prevent the possible bias introduced by different source videos, we conduct experiments on FF++, as it provides fake videos created by multiple face forgery methods for the same source videos.

Table 3

Classification accuracy (%) of our training framework and baseline (ADD-ResNet50) on FaceForenscics++ regarding Real images and Fake images

Datasets	Models	C23		C40
Datasets	Models	Real	Fake	Real	Fake
DeepFakes	Baseline	93.98	95.93	89.34	94.56
DeepFakes	Ours	96.38	99.26	96.85	94.06
Face2Face	Baseline	90.88	95.02	87.26	88.68
Face2Face	Ours	98.05	95.07	94.66	84.94
FaceSwap	Baseline	96.94	87.92	89.09	84.35
FaceSwap	Ours	98.43	94.56	96.58	84.72
NT	Baseline	83.84	89.33	84.41	68.21
NT	Ours	94.87	89.04	81.47	75.16
ALL	Baseline	93.98	95.93	89.34	94.56
ALL	Ours	90.13	95.96	90.12	94.84

In this section, we compare our proposed method with current state-of-the-art face forgery detection methods. As is shown in Table 2, for FF++ dataset, our method consistently achieves great improvements under different quality settings. Especially on the challenging C40 setting, compared with ADD-ResNet50 [23], the ACC score of our method exceeds it by 13.48%, there are also a 3.22% improvement compared to the most recent paper [57]. To explain, we use domain invariant features to portray the correct artifacts, which are further learned by the network through high-dimensional distribution invariant feature learning; and the multi-scale and multi-angle design allows the network to learn more artifact traces, improving low-quality fake face detection. And we can find that our method has more significant results on those more challenging datasets, such as Face2Face and NeuralTextures datasets. Because our network learns invariant features in Raw and C40 data in real time in an end-to-end approach, it makes the network more secure in learning artifacts and outperforms previous networks that perform feature learning through obvious artifact traces. This demonstrates the effectiveness of our proposed method.

Table 4

The classification ACC (%) and the Area Under Curve AUC (%) of our method and baseline, the experimental results of baseline are replicated

Datasets	Models	C23		C40
Datasets	Models	ACC	AUC	ACC	AUC
DeepFakes	Baseline	94.96	98.63	91.90	97.53
DeepFakes	Ours	97.82	99.84	95.47	99.23
Face2Face	Baseline	92.72	97.80	87.91	94.89
Face2Face	Ours	96.74	98.67	90.25	96.08
FaceSwap	Baseline	92.89	97.40	86.95	93.68
FaceSwap	Ours	96.70	99.20	91.30	96.73
NT	Baseline	86.35	94.26	77.17	85.39
NT	Ours	92.26	96.72	78.65	87.04
ALL	Baseline	87.62	93.31	83.69	88.01
ALL	Ours	93.00	97.57	84.95	89.40

In order to perform a more accurate analysis, we broadcast the accuracy of real and fake images separately. From the baseline, we can clearly find that the accuracy of the model for real images is usually lower than that of Fake images. This indicates to some extent that we can improve the detection by increasing the discrimination rate of real images, and as shown in the Table 3, our model does this. This again illustrates that our model can capture the common invariant artifact traces in both data streams well and achieve more accurate classification. And as can be seen in Table 4, the method proposed in this paper outperforms the baseline in terms of both ACC and AUC.

In order to illustrate the superior of our proposed method, we also visualized the t-SNE [41] feature spaces of different data in FF++(LQ) to explore the influence of our components on the distribution of learned representations. We can observe from Fig. 9 that the features extracted from ADD-ResNet50 are compactly gathered in the t-SNE embedding space, which limits the discrimination of the forged faces against real faces.

In particular, the features of NeuralTextures and Face2Face fake faces and real faces are compacted together, because this method only performs small-scale manipulation. After adding domain invariant features extracted from the Raw and C40 domain by hf-IFLM and hd-FDLM, the distribution of learned fusion representations has changed. More manipulated faces tend to be farther from real faces and other categories, representations of the same class are pulled together, while the distances between classes are significantly increased. These distribution changes show that the common invariant artifacts captured by our method in different domains help distinguish fake faces from real ones.

We visualize the classification accuracy of our method and baseline on FF++. And as shown in Fig. 10, our proposed method DIFLD outperforms the baseline ADD-ResNet50 model in overall.

Ablation study and discussions

In this subsection, we investigate the feasibility of the two modules proposed in this paper, and the results are shown in Table 5.

Table 5

The effect of each module on the final results experimented on C40 NeuralTextures dataset

Method	ACC (%)
ADD - ResNet50	68.53
W/o hf-IFLM module	81.16
W/o hd-FDLM module	80.56
Our method	82.01

In the past, NeuralTextures (NT) datasets have proven to be the most difficult to distinguish by human eyes and DNNs. Therefore, we conducted the ablation study on the C40 high-quality NT dataset. We can observe that hf-IFLM module improves the accuracy by about 12.03%. The hd-FDLM improves 12.63%. Finally, combining the hf-IFLM and the hd-FDLM significantly improved accuracy by 82.01%. The results of our ablation study suggest that each proposed modules contributes differently to framework’ ability to learn the domain invariant features from Raw and C40 datasets, and that they are compatible when integrated together for optimal performance.

Conclusion

In this paper, using domain invariant features to inscribe the correct causality, we propose a new deep forged face detection algorithm that improves the robustness under low-quality images. Specifically, we first exploit the distribution-invariant property to extract domain knowledge and design a distribution-invariant feature extraction module to extract artifacts. Then, we propose the high-frequency enhancement module to enhance the detection capability of low-quality images, by exploiting high-frequency information from the Raw dataset. Experimental results on the FaceForensc++ dataset, especially on the C40 compressed dataset, show that the algorithm outperforms other state-of-the-art algorithms.

DIFLD ensures the comprehensiveness of the model when detecting artifacts, thus improving the detection capability, but, therefore, reducing the generalization capability to some extent. Therefore, in the future, improving the generalization capability is an issue that needs to be considered. In addition, since all state-of-the-art algorithms including the algorithm proposed in this paper focus on plain text deep forgery video detection, we will try to detect encrypted deep forgery videos to protect privacy.

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (Grant Number 61971078) and Chongqing University of Technology Graduate Education Quality Development Action Plan Funding Results (Grant Number gzlcx20223206), which provided domain expertise and computational power that greatly assisted the activity.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article A discrete-time distributed optimization algorithm for cooperative transportation of multi-robot system

next article Adaptive fractional tracking control of robotic manipulator using fixed-time method

https://www.youtube.com/watch?v=AmUC4m6w1wo.

https://www.youtube.com/watch?v=cQ54GDm1eL0.

Sardar Alamgir, Umer Saiyed, Rout Ranjeet Kumar (2023) Face recognition system using multicolor image analysis and template protection with biocryptosystem. In: Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers, pp 457–473. Springer

Kathirvel A (2023) Debashreet Das, Stewart Kirubakaran, M Subramaniam, and S Naveneethan. Artificial intelligence–based mobile bill payment system using biometric fingerprint. In: Recurrent Neural Networks, pp 233–245. CRC Press

Goodfellow I, Pouget-Abadie J, Mirza M, Bing X, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144MathSciNetCrossRef

Lee Cheng-Han, Liu Ziwei, Wu Lingyun, Luo Ping (June 2020) Maskgan: Towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Gao L, Chen D, Zhao Z, Shao J, Shen HT (2021) Lightweight dynamic conditional gan with pyramid attention for text-to-image synthesis. Pattern Recogn 110:107384CrossRef

Deepfakes. [Online]. Available: https://github.com/deepfakes/faceswap. Accessed 2021

Zhao Cairong, Wang Chutian, Hu Guosheng, Chen Haonan, Liu Chun, Tang Jinhui (2023) Istvt: Interpretable spatial-temporal video transformer for deepfake detection. IEEE Transactions on Information Forensics and Security

Liang B, Wang Z, Huang B, Zou Q, Wang Q, Liang J (2023) Depth map guided triplet network for deepfake face detection. Neural Netw 159:34–42CrossRefPubMed

Mohiuddin Sk, Sheikh Khalid Hassan, Malakar Samir, Velásquez Juan D, Sarkar Ram (2023) A hierarchical feature selection strategy for deepfake video detection. Neural Computing and Applications, pp 1–18

10.

Li H, Li B, Tan S, Huang J (2020) Identification of deep network generated images using disparities in color components. Signal Process 174:107616CrossRef

11.

Xia Z, Qiao T, Ming X, Zheng N, Xie S (2022) Towards deepfake video forensics based on facial textural disparities in multi-color channels. Inf Sci 607:654–669CrossRef

12.

Ciftci Umur Aybars, Demir Ilke, Yin Lijun (2020) Fakecatcher: Detection of synthetic portrait videos using biological signals. IEEE transactions on pattern analysis and machine intelligence

13.

Guo Hui, Hu Shu, Wang Xin, Chang Ming-Ching, Lyu Siwei (2022) Eyes tell all: Irregular pupil shapes reveal gan-generated faces. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2904–2908. IEEE

14.

Ganguly S, Ganguly A, Mohiuddin S, Malakar S, Sarkar R (2022) Vixnet: Vision transformer with xception network for deepfakes based video and image forgery detection. Expert Syst Appl 210:118423CrossRef

15.

Cao Junyi, Ma Chao, Yao Taiping, Chen Shen, Ding Shouhong, Yang Xiaokang (2022). End-to-end reconstruction-classification learning for face forgery detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 4103–4112

16.

Hsu Chih-Chung, Lee Chia-Yen, Zhuang Yi-Xiu (2018) Learning to detect fake face images in the wild. In: 2018 international symposium on computer, consumer and control (IS3C), pp 388–391. IEEE

17.

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90CrossRef

18.

Wang Chengrui, Deng Weihong (2021) Representative forgery mining for fake face detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14923–14932

19.

Chen B, Li T, Ding W (2022) Detecting deepfake videos based on spatiotemporal attention and convolutional lstm. Inf Sci 601:58–70CrossRef

20.

Dong Xiaoyi, Bao Jianmin, Chen Dongdong, Zhang Ting, Zhang Weiming, Yu Nenghai, Chen Dong, Wen Fang, Guo Baining (2022) Protecting celebrities from deepfake with identity consistency transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 9458–9468

21.

Zhao Hanqing, Zhou Wenbo, Chen Dongdong, Wei Tianyi, Zhang Weiming, Yu Nenghai (2021) Multi-attentional deepfake detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2185–2194

22.

Chen H, Li Y, Lin D, Li B, Junqiang W (2023) Watching the big artifacts: Exposing deepfake videos via bi-granularity artifacts. Pattern Recogn 135:109179CrossRef

23.

Woo S et al (2022) Add: Frequency attention and multi-view based knowledge distillation to detect low-quality compressed deepfake images. In Proceedings of the AAAI Conference on Artificial Intelligence 36:122–130CrossRef

24.

Arjovsky Martin, Bottou Léon, Gulrajani Ishaan, Lopez-Paz David (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893

25.

Rössler Andreas, Cozzolino Davide, Verdoliva Luisa, Riess Christian, Thies Justus, Nießner Matthias (2019) Faceforensics++: Learning to detect manipulated facial images. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 1–11

26.

Li Yuezun, Chang Ming-Ching, Lyu Siwei (2018) In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE international workshop on information forensics and security (WIFS), pp 1–7. IEEE

27.

McCloskey Scott, Albright Michael (2019) Detecting gan-generated imagery using saturation cues. In: 2019 IEEE international conference on image processing (ICIP), pp 4584–4588. IEEE

28.

Wang G, Jiang Q, Jin X, Cui X (2022) Ffr_fd: Effective and fast detection of deepfakes via feature point defects. Inf Sci 596:472–488

29.

Wang Gaojian, Jiang Qian, Jin Xin, Li Wei, Cui Xiaohui (2022) Mc-lcr: Multimodal contrastive classification by locally correlated representations for effective face forgery detection. Knowledge-Based Systems, p 109114

30.

Quinonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2008) Dataset shift in machine learning. MIT Press, CambridgeCrossRef

31.

Torrence C, Compo GP (1998) A practical guide to wavelet analysis. Bull Am Meteor Soc 79(1):61–78ADSCrossRef

32.

Mallat S (1996) Wavelets for a vision. Proc IEEE 84(4):604–614CrossRef

33.

Bracewell Ronald Newbold, Bracewell Ronald N (1986) The Fourier transform and its applications, volume 31999. McGraw-Hill New York

34.

Oran Brigham E, Morrow RE (1967) The fast fourier transform. IEEE Spectr 4(12):63–70CrossRef

35.

He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

36.

Zhang Xu, Karaman Svebor, Chang Shih-Fu (2019) Detecting and simulating artifacts in gan fake images. In: 2019 IEEE international workshop on information forensics and security (WIFS), pp 1–6. IEEE

37.

Hu Hailong, Li Yantao, Zhu Zhangqian, Zhou Gang (2018) Cnnauth: continuous authentication via two-stream convolutional neural networks. In: 2018 IEEE international conference on networking, architecture and storage (NAS), pp 1–9. IEEE

38.

Li Y, Hailong H, Zhu Z, Zhou G (2020) Scanet: sensor-based continuous authentication with two-stream convolutional neural networks. ACM Transactions on Sensor Networks (TOSN) 16(3):1–27CrossRef

39.

Kohli A, Gupta A (2021) Detecting deepfake, faceswap and face2face facial forgeries using frequency cnn. Multimedia Tools and Applications 80:18461–18478CrossRef

40.

Huang Yuge, Shen Pengcheng, Tai Ying, Li Shaoxin, Liu Xiaoming, Li Jilin, Huang Feiyue, Ji Rongrong (2020) Improving face recognition from hard samples via distribution distillation loss. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pp 138–154. Springer

41.

Van der Maaten Laurens, Hinton Geoffrey (2008) Visualizing data using t-sne. Journal of machine learning research, 9(11)

42.

SHONG Y, GAO X, ZHANG D et al (2017) The piecewise non-linear approximation of the sigmoid function and its implementation in fpga. Application of Electronic Technique 43(8):49–51

43.

Danielsson P-E (1980) Euclidean distance mapping. Comput Graphics Image Process 14(3):227–248CrossRef

44.

Zhang Ying, Xiang Tao, Hospedales Timothy M, Lu Huchuan (2018) Deep mutual learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4320–4328

45.

Thies Justus, Zollhöfer Michael, Stamminger Marc, Theobalt Christian, Nießner Matthias (2016) Face2face: Real-time face capture and reenactment of rgb videos. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2387–2395

46.

Kowalskil Marek Faceswap. [Online]. Available: https://github.com/MarekKowalski/FaceSwap/. Accessed 2020

47.

Thies Justus, Zollhöfer Michael, Nießner Matthias (2019) Deferred neural rendering: Image synthesis using neural textures. arxiv Computer Vision and Pattern Recognition

48.

Zhang K, Zhang Z, Li Z, Qiao Yu (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503ADSCrossRef

49.

Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32

50.

Kingma Diederik P, Ba Jimmy (2014) Adam: A method for stochastic optimization. CoRR, arXiv:1412.6980

51.

Dogonadze Nika, Obernosterer Jana, Hou Ji (2020) Deep face forgery detection. arXiv preprint arXiv:2004.11804

52.

Qian Yuyang, Yin Guojun, Sheng Lu, Chen Zixuan, Shao Jing (2020) Thinking in frequency: Face forgery detection by mining frequency-aware clues. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII, pp 86–103. Springer

53.

Romero Adriana, Ballas Nicolas, Kahou Samira Ebrahimi, Chassang Antoine, Gatta Carlo, Bengio Yoshua (2014) Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550

54.

Zagoruyko Sergey, Komodakis Nikos (2016) Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928

55.

Wang Xiaolong, Girshick Ross, Gupta Abhinav, He Kaiming (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803

56.

Das Sowmen, Seferbekov Selim, Datta Arup, Islam Md, Amin Md, et al (2021) Towards solving the deepfake problem: An analysis on improving deepfake detection using dynamic face augmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3776–3785

57.

Xu Ying, Raja Kiran, Verdoliva Luisa, Pedersen Marius (2023) Learning pairwise interaction for generalizable deepfake detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 672–682

Title: DIFLD: domain invariant feature learning to detect low-quality compressed face forgery images
Authors: Yan Zou
Chaoyang Luo
Jianxun Zhang
Publication date: 24-07-2023
Publisher: Springer International Publishing
Published in: Complex & Intelligent Systems / Issue 1/2024
Print ISSN: 2199-4536
Electronic ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-023-01160-x

Springer Professional

DIFLD: domain invariant feature learning to detect low-quality compressed face forgery images

Abstract

Publisher's Note

Introduction

Methods

Overview

hf-IFLM

hd-FDLM

Experimental

Experiment settings

Detection performance

Ablation study and discussions

Conclusion

Acknowledgements

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

Methods

Overview

hf-IFLM

hd-FDLM

Experimental

Experiment settings

Detection performance

Ablation study and discussions

Conclusion

Acknowledgements

Publisher's Note

Other articles of this Issue 1/2024

Adaptive fractional tracking control of robotic manipulator using fixed-time method

Uncertainty guided ensemble self-training for semi-supervised global field reconstruction

Cybersecurity knowledge graphs construction and quality assessment

Maximal sparse convex surrogate-assisted evolutionary convolutional neural architecture search for image segmentation

Rotation-equivariant transformer for oriented person detection of overhead fisheye images

Asymmetric gradient penalty based on power exponential function for imbalanced data classification

Premium Partner