With the rapid development of deep learning, face forgery detection methods have also achieved remarkable progress. However, most methods suffer significant performance degradation on low-quality compressed face images. It is due to: (a) The image artifacts will be blurred in the process of image compression, resulting in the model learning insufficient artifact traces; (b) Low-quality images will introduce a lot of noise information, and minimizing the training error causes the model to absorb all correlations in the training dataset recklessly, leading to the over-fitting problem. To solve the above problems, we consider learning domain invariant representations to inscribe the correct relevance, i.e., artifacts, to improve the robustness of low-quality images. Specifically, we propose a novel face forgery detector, called DIFLD. The model has the following components: (1) a high-frequency invariant feature learning module(hf-IFLM), which effectively retrieves the blurred artifacts in low-quality compressed images; and (2) a high-dimensional feature distribution learning module(hd-FDLM), that guides the network to learn more about the consistent features of distribution. With the above two modules, the whole framework can learn more discriminative correct artifact features in an end-to-end manner. Through extensive experiments, we show that our proposed method is more robust to image quality variations, especially in low-quality images. Our proposed method achieves a 3.67% improvement over the state-of-the-art methods on the challenging dataset NeuralTextures.
Notes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Introduction
Human faces play an important role in our daily life, such as access control [1] and face payment [2]. However, in recent years, deep learning techniques [3‐5] have been misused in the production of forgery faces, which has led to the proliferation of faked videos and images on the Internet, represented by “Deepfakes” [6]. By tampering with or replacing the face information of original videos, these deep forgery technologies reduce or even distort the authenticity of the information we see online. This not only affects companies and celebrities but also poses a great threat to the lives and work of ordinary people. Figure 1 shows some faked face images generated by tampering with original real images through deep forgery techniques. (A) and (C) are the ex-US President Barack Obama, created by Buzzfeed1 in collaboration with Monkeypaw Studios; (B) and (D) are the faked video2 frames of ordinary people. These generated face forgery images are nearly free of forgery traces, and it is difficult for humans to determine the authenticity of these images with the naked eyes. Therefore, it becomes more and more urgent to develop effective detection methods.
×
Toward such a concern, many methods [7‐15] have been proposed successively. Traditional methods focus on designing non-learning algorithms to extract forged features from images, to discover the differences between real and tampered faces. These features could be divided into color features [10, 11], biological features [12, 13], and other discriminative features. Ultimately, the handcrafted features are fed into the machine learning algorithms for classification. Although traditional detection methods are well established, they are limited to small- and medium-scale datasets. In today’s era of big data explosion, traditional methods lose their advantages of high efficiency. Second, with the development of deep forgery techniques, fake faces are becoming more and more realistic, which presents a great challenge [16] to traditional methods based on specific artifacts.
Advertisement
To capture more subtle differences between real and fake images, later works focus on using deep convolutional neural networks (CNNs) [17] to learn discriminative features from training datasets to achieve face forgery detection. Using a backbone of ready-made CNNs, these methods take facial images directly as input and then classify them as real or fake. These vanilla CNNs, however, tend to look for forgeries in a limited region of faces, indicating that the detectors lack the understanding of forgery [18]. Since then, the mainstream of the research has gradually shifted to improving face forgery detection based on specific prior knowledge by optimizing network structures [14, 19‐21] or designing corresponding end-to-end learning frameworks [15, 22, 23]. Though recent works have reached sound achievements, low-quality datasets can easily lead to the failure of existing methods. That is because: (a) They always rely on forgery patterns that are possessed by a certain manipulation technique presented in the training datasets, but low-quality images will blur out the artifacts. (b) Compression of images will introduce a lot of extraneous noise information, but current methods do not notice it. Thus, in the real world, the available methods usually do poorly to achieve the desired results, due to face forgery images shared on social media are usually compressed.
As is shown in Table 1, by extensive comparative experimental analysis we found, that low-quality(LQ) face forgery detection is prone to the over-fitting problem. Inspired by the literature [24], we attribute this problem to the fact that minimizing the training error leads the model to learn all correlations in the training data regardless of the consequences. However, LQ images blur the boundaries between true and false artifacts, making the model more susceptible to selectivity bias and confounding factors. The consequence is that the model fails to learn sufficiently adequate correct causality, but also learns spurious correlations, i.e., non-artifacts. To address these issues, we propose using domain invariant features inscribe the correct causality, mitigate the over-reliance of the model on data bias, alleviate the over-fitting problem, and improve fake face detection on LQ images. Specifically, first, for the intermediate feature layer, we design high-frequency invariant feature learning modules. Second, for the high-dimensional feature layer, we carry out the high-dimensional feature distribution invariant loss design. The two modules enable the model to learn the correct artifact traces in an end-to-end manner.
Table 1
Comparative experimental results of train and test datasets on NeuralTextures RAW (high-quality) and C40 (low-quality) datasets from FaceForensics++ [25], report on Accuracy (Acc)
NeuralTexture
C40
RAW
Train dataset
98.50%
99.50%
Test dataset
79.08%
97.12%
With the above considerations in mind, in this paper, we present a novel face forgery detector framework. Our contributions in this paper are summarized as follows:
This paper is the first to use domain invariance to inscribe the correct artifacts and alleviate the over-fitting problem on low-quality compressed images.
The hf-IFLM and hd-FDLM modules are proposed to learn domain invariant feature representation. They correspond to high-frequency invariant feature learning in the middle layer and high-dimensional feature distribution learning in the output layer, respectively.
Compared to the commonly used frequency domain analysis, we use the haar wavelet transform to exploit both spatial and frequency information, and to learn the common invariant feature representation more sufficiently by designing a multi-scale and multi-angle learning module.
The remainder of this paper is organized as follows: Section Related work reviews some preliminaries of the proposed algorithm. Section Methods shows the architecture of the DIFLD and proposed algorithm in detail. Section Experimental presents the comparison and ablation experimental results and analysis. Finally, this paper is summarized in Sect. Conclusion.
Related work
Face forgery detection. Nowadays, face forgery detection has drawn significant attention, as it is related to the protection of personal privacy. So far, many face forgery detection methods have been proposed by academia and industry fields, which have contributed greatly to the research in this direction. Such as paper [13] proposed to judge the real and fake images based on the shape of the pupils. For a person, the shape of the pupils is generally elliptical, while the authors find that the faces generated using techniques such as GAN [3] have an irregular pupil shape. In [26], blink frequency was used for fake face detection. However, these methods [10‐13, 27, 28] need extract handcrafted features, it is not only time-consuming and labor-intensive but also very challenging. Later, with the development of deep learning, forged face detection researchers begin to shift to deep learning methods [14, 29], aiming to achieve more flexible and reliable detection through dynamic feature learning.
Advertisement
For example, Shreyan Ganguly [14] proposes a Vision Transformer with Xception Network(ViXNet), which to learn the consistency of these almost imperceptible artifacts left by face forgery methods on the entire facial region. Nevertheless, they are insufficient to cope with low-quality images and do not have a high practical value. To counter the problem, various additional information is used to enhance performance. Most of them are based on the spatial domain such as RGB and HSV. For example, Some approaches [12, 13] exploit specific artifacts arising from the synthesis process such as color or shape cues. Wang [29] propose a framework, multimodal contrast classification local correlation representation (MC-LCR), for effective face forgery detection. Instead of specific appearance features, MC-LCR amplifies the implicit local differences between real and forged faces from the spatial and frequency domains. Based on the complementary nature of amplitude and phase information, they develop a patch-wise amplitude and phase dual attention module to capture locally relevant inconsistencies in the frequency domain. But their effectiveness is limited to the datasets on which they are specially trained. In order to be more resilient, we turn to learning domain invariant features to inscribe causality, limiting network learning of spurious artifacts to improve the accuracy of detection on low-quality images.
Domain generalization. Traditional machine(ML) learning models are trained based on the i.i.d. assumption that training and testing sets are identically and independently distributed. However, this assumption does not always hold in reality. When the probability distributions of training data and testing data are different, the performance of ML models often deteriorates due to domain distribution gaps [30]. Collecting the data of all possible domains to train ML models is expensive and even prohibitively impossible. Therefore, enhancing the generalization ability of ML models is important in both industry and academic fields. The goal of domain generalization is to learn a model from one or several different but related domains (i.e., diverse training datasets) that will generalize well on unseen testing domains. That is, to train a good ML model that has minimum prediction error in classifying images coming from testing domains.
Most of the domain generalization methods belong to domain feature alignment schemes, whose core idea is to learn domain invariant representations as a way to minimize the differences between source domains. In this work, due to the difference in resolution, we treat the Raw dataset and the C40 dataset as two different domains, which are our training datasets, as shown in Fig. 2. And we learn domain invariant features by invariant risk minimization [24]. The goal of representation learning can be formulated as Eq. 1:
Where g is a representation learning function, f is the classifier function, \(\ell _{reg}\) denotes some regularization term, \(\lambda \) is the trade-off parameter, \({\mathbb {E}}\) represents expectation, and \(\ell (\cdot )\) is loss function for classification.
Wavelet transform. Wavelet transform (WT) [31, 32] is a new transform analysis method, compared to the Fourier transform [33, 34], where the signal is expressed as a sum of a sine function and a cosine function. The main difference is that wavelets are localized in both the time and frequency domains while the standard Fourier transform is localized only in the frequency domain. The major characteristics of WT are the ability to fully highlight certain aspects of the problem through transformations, the ability to localize the analysis of temporal and spatial frequencies, and the gradual multi-scale refinement of signals through telescoping translation operations.
Overall, WT is a multi-scale, multi-resolution decomposition of an image that can be focused on arbitrary details of the image and is known as a mathematical microscope. Based on microscopic properties of WT, this paper uses WT to focus on the falsification details in the image.
Methods
In this work, we proposed a deep-learning-based classification model, called DIFLD, which can discriminate manipulated facial images generated by different deep face forgery generation techniques, from the real ones. The working principle of the proposed model is summarized pictorially in Fig. 4. It comprises two main modules: (1) hf-IFLM, described in Sect. hf-IFLM, (2) hd-FDLM, described in Sect. hd-FDLM.
Overview
As shown in Fig. 4, in order to improve the detection accuracy of low-quality forged images, we propose two cooperative modules from two different perspectives by learning domain-invariant features, which are hf-IFLM and hd-FDLM. First, both the C40 dataset and the raw dataset mentioned in this paper are the original uncompressed dataset and the highly compressed dataset from the FaceForenscis++ dataset. From the perspective of learning more artifact traces, we design the hf-IFLM module. By forwarding a C40 compressed input image and its corresponding Raw image through the backbone network, we extract corresponding truncated feature maps \({\widetilde{X}}_i(c,x,y) \in {\mathbb {R}}^{C \times W \times H}\) and \({\widehat{X}}_i(c,x,y) \in {\mathbb {R}}^{C \times W \times H}\) from Stage i of ResNet-50 [35]. Here, for convenience, we use the notation \({\varvec{X_i}}\) to denote the group \({\widetilde{X}}_i(c,x,y)\) and \({\widehat{X}}_i(c,x,y)\). Then the \({\varvec{X_i}}\) is passed forward to the Subtra module, in which we use multi-angle high-frequency information to compensate the missing features \(Loss\_i\) that are discarded during compression. And from the perspective of learning the correct artifact traces, eliminating non-fake traces. In the hd-FDLM module, we obtained two distributions \(P_{raw}\) and \(P_{c40}\) by computing the distribution of the backbone network output layer feature maps \({\widetilde{F}}_{hd}\) and \({\widehat{F}}_{hd}\), i.e., \({\varvec{F\_hd}}\). Then, the distribution distance \({\mathcal {L}}_{hd}\) is calculated to constrain the similarity between Raw and C40. In this way, the network can learn more distribution consistency information, which is used to portray the correct artifact traces, and remove the noise information during the training process.
The representations (\(Loss\_1\), \( Loss\_2\), \(Loss\_3\), \( Loss\_4\), \({\varvec{F\_{hd}}}\)) extracted from the different modules are learned in an end-to-end manner by designing loss functions, where the training process is guided by joint losses \({\mathcal {L}}_{cls}\), \({\mathcal {L}}_{hf}\) and \({\mathcal {L}}_{hd}\). The overall learning of DIFLD is given in the corresponding Algorithm 1.
hf-IFLM
Motivation. In the literature [11, 13], face forgery detection methods are usually using spatial domain features. However, frequency domain features are equally important. In [36], the authors showed that the spectrum-based classifier performs better than the pixel-based classifier in detecting forged images. The authors designed GAN for synthesizing artifacts and added up-sampling components to GAN for detecting these artifacts. Applications of frequency domain features can also be seen in other authentication scenarios, such as smartphone user detection. In [37], the authors design a dual-stream network that uses frequency domain features to verify legitimate users or impostors on smartphones. Similarly, in [38] the authors designed a frequency-based approach to verify the security of smartphones. Thus, inspired by the successful use of frequency domain features described above, we propose to use frequency information to detect faked faces.
×
×
×
As shown in Fig. 3, the synthetic face region (in the red box area) and the background region present obviously different distribution in high-frequency maps. It indicates that high-frequency information of images can capture forgery traces very well [23, 39]. Moreover, as shown in the last row of Fig. 3, the artifacts of generated facial images are blurry while the images are compressed, which prevents the network from learning artifacts. But we discover from the top row, the traces of artificial forgery are clearly visible. Thus, to compensate for the lack of low-quality (C40) images, we attempt to exploit artifacts with high-frequency information from high-quality (Raw) images.
It is well known that the shallow layers of DNNs tend to low-level features (e.g., color and texture), while the deeper layers tend to be high-level abstract features [29]. Therefore, in order to extract more sufficient artificial traces of forgery, we carried out high-frequency artifact extraction for Stage 1–4 of our backbone network. Besides, as is shown in Fig. 3, the high-frequency information in horizontal, vertical, and diagonal directions can capture artifacts well. So, we use all three for feature extraction simultaneously. In summary, we design a multi-scale and multi-angle feature extraction module for high-frequency information, as shown in the top stream of Fig. 4.
In order to convert spatial domain information into frequency domain information, we transform the output of Stage 1–4 of ResNet-50. Compared with the conventional use of discrete Fourier transform (DFT) [32], our work uses Haar wavelet transform (WT) [33] because the WT is replacing a trig basis of infinite length with a wavelet basis of finite length that will decay. This allows the network to capture not only the frequency information but also the time information.
×
Design of hf-IFLM. First, by forwarding a C40 compressed input image and its corresponding Raw image through the backbone network, we obtain features \({\widetilde{X}}_i(c,w,h) \in {\mathbb {R}}^{C \times W \times H}\) denotes the feature map of output of Raw datasets from the Stage i layer of backbone, and \({\widehat{X}}_i(c,w,h) \in {\mathbb {R}}^{C \times W \times H}\) denotes the feature map of output of C40 datasets, which have C channels, the width of W, and the height of H. For these outputs of the backbone network, we use the following Eq. 2 for wavelet decomposition. In Fig. 5, we visualized the process diagram of Haar wavelet decomposition for two-dimensional images.
where c, w, and h denote the \(c_{th}\), \(x_{th}\) and \(y_{th}\) slice in the channel, the width, and the height dimension of \({\widetilde{X}}_i(c,w,h) \) and \({\widehat{X}}_i(c,w,h)\), respectively. \(\Im (\cdot )\) denotes wavelet transform. \({\mathcal {H}}_{{\bar{\chi }}_i}(c,a,b)\) is composed of horizontal high-frequency information \({{\mathcal {H}}}_{{\bar{\chi }}_i}(c,a,b)_H\), vertical high-frequency information \({{\mathcal {H}}}_{{\bar{\chi }}_i}(c,a,b)_V\), and diagonal high-frequency information \({{\mathcal {H}}}_{{\bar{\chi }}_i}(c,a,b)_D\). \({\bar{\chi }}_i\) for the corresponding \({\widetilde{X}}_i\) or \({\widehat{X}}_i\), and a and b are equal to half of x and y, respectively.
After obtaining high-frequency information about two different resolution datasets, as is shown in Fig. 6, take the vertical direction of Stage i for example, \(i=1,2,3,4\). We assume that the high-frequency difference can offset the artifact loss of C40 datasets due to compressed data. The specific calculation formula is as follows:
In order to make the difference matrix \({\mathcal {D}}^H_i(c,a,b)\) work better, we design an attention weight \(\omega (a,b)\) using the human property of recognizing objects through the naked eye, i.e., the surrounding pixel points have an impact on the recognition of the current pixel point. Therefore, as calculated in Eq. 4, we use the cross-channel difference indices of four neighboring pixel points as weights.
Here \(\gamma _{hf}\) is the mean value of the four adjacent pixel points. This design of attention weights ensures that the model pays more attention to the high frequency of loss.
In the end, this paper presents a method to realize multi-scale and multi-angle invariant feature learning. We accumulate losses from Stage 1–4 in several directions above in Eq. 5.
Among them, \({\varvec{\delta }}\) and \({\varvec{\eta }}\) are vector weight parameters of different angles and scales.
The whole module is trained in an end-to-end way. By \( {\mathcal {L}}_{hf}\) minimizing the training errors of Raw and C40 datasets, the network learns more high-frequency domain invariant features to cope with low-quality forged face detection.
hd-FDLM
Motivation. Thanks to the paper [40], the author proposes a novel Distribution Distillation Loss to narrow the performance gap between easy and hard samples. The main idea of this paper is to construct two similarity distributions: a teacher distribution from easy samples and a student distribution from hard samples, and design a novel distribution-driven loss to constrain the student distribution to approximate the teacher distribution, which thus leads to smaller overlap between the positive and negative pairs in the student distribution.
In this paper, we utilize a similar idea for fake face detection of low-quality images. As shown in Fig. 7, we find a significant difference in the distribution of Raw and C40 images after the backbone. Therefore, we intend to use two different feature distribution information to learn domain invariant features, thus reducing the overlap of real and false artifacts in the low-quality distribution, enabling the model to learn the correct correlation.
In order to construct probability distributions of Raw and C40 datasets, we use T-SNE [41] to represent the distribution of features between Raw and C40. Because that the T-SNE algorithm not only can convert the data into a low-dimensional distribution that is easier to manipulate, but we can also preserve the local characteristics of the data. The data that were close together should be close together after conversion, and the data that were far away should be far away after conversion.
×
Design of hd-FDLM. Since the high-dimensional features [9] can extract artifact traces with more subtlety, this facilitates our detection. As shown in Fig. 8, by forwarding a C40 compressed input image and its corresponding Raw image through the backbone network, we obtain features \({\widetilde{F}}_{hd}\) and \({\widehat{F}}_{hd}\) from the high-dimensional layer, correspondingly. Then, after putting \({\widetilde{F}}_{hd}\) and \({\widehat{F}}_{hd}\) through global average pooling and fully connected layers, the Sigmoid function [42] is computed for each channel, and the Sigmoid value of each convolved channel is obtained.
×
The T-SNE computational distribution is calculated as follows— Eq. 6:
Among them, \({\mathcal {S}}\left( x_{i}, x_{j}\right) \) is the similarity between data i and data j. The closer the distance, the more similar it is. In this paper, we use Euclidean distance [43] to calculate the similarity between feature maps, as shown in the Eq. 7. Suppose there are n data, then we will define n probability distributions for these n data.
After getting the distribution of the two data streams, we denote the distribution of Raw dataset as \(P_{raw}\), and the distribution of C40 dataset denoted as \(P_{c40}\). To narrow the performance gap between the C40 and Raw datasets, we constrain the similarity distribution of C40 samples to approximate the similarity distribution of Raw samples. Motivated by the previous Kullback–Leibler divergence (KL) [40, 44], we adopt the KL divergence to constrain the similarity between the Raw and C40 distributions, which is defined as follows:
where n denotes the length of one batch data. Ultimately, the whole framework exploits the common features from the distribution information of different data streams. Our method incorporates a novel distribution loss to let the network learn more about the correct artifacts.
And the entire network is trained in an end-to-end manner by jointly minimizing the classification loss \({\mathcal {L}}_{cls}\), the high-frequency loss \({\mathcal {L}}_{hf}\), and the distributional-learning loss \({\mathcal {L}}_{hd}\). The two modules synergistically complement the domain invariant features by minimizing the training error of raw and C40 data streams, allowing the network to learn more and more adequate forgery traces, and enabling the network to learn more correct artifacts to improve forged face detection.
Experimental
In this section, the performance of our method is evaluated. In Section Experiment settings, we first introduce the details of the experimental datasets and describe how to prepare training and testing datasets from these datasets. In addition, we further discuss the experimental setups. Then in Section Detection performance, we present extensive experimental results compared with the state-of-the-art (SOTA) methods to demonstrate the superiority of the proposed method. Finally, in Section Ablation study and discussions, the effectiveness of the hf-IFLM and the hd-FDLM module is verified by ablation experiments.
Experiment settings
Datasets. Following previous face forgery detection methods [11, 14, 15, 19‐23], we conduct our experiments on the challenging FaceForensics++ (FF++) [25] dataset, released in 2019. It is a large-scale forensics dataset consisting of 1000 real human face videos from 977 YouTube videos, and 4000 fake human face videos obtained by four face manipulation techniques: DeepFakes [6], Face2Face [45], FaceSwap [46] and NeuralTextures [47]. It is worth noting that the AI-synthesized videos spreading on social network are usually compressed. So, these videos (RAW) are compressed into two versions in order to simulate the realistic forensic scenario: medium compression (C23) and high compression (C40), using the H.264 codec with a constant rate quantization parameter of 23, and 40, respectively. Because the benchmark tests have achieved nearly perfect detection performance in the RAW version, our method aims to the compressed versions, mainly use the C40 version.
We constructed 15 mutually exclusive datasets containing 1000 videos, respectively. Following the division of FF++, which divides each dataset into train, validate, and test datasets containing 720, 140, and 140 videos, respectively. For all videos, we crop the face region by the official given mask instead of using MTCNN [48], which may crop the non-artifactual region. Then we save the aligned facial images as inputs with a size of 128 \(\times \) 128 pixels. By sampling each video at an interval of 30 frames, we obtain 15 datasets containing 30,000 images.
Implementation and hyper-parameters In the experiments, We use the PyTorch [49] framework to implement our method, and ResNet50 [35] as the backbone. We use the Adam optimizer [50] with \(\beta _1 = 0.9\), \(\beta _2 = 0.999\), \(\epsilon = 10^{-8}\). The learning rate is initialized to 1 \(\times \)\(10^{-4}\) and weight decay is 1 \(\times \)\(10^{-8}\). The above experimental setup follows the baseline experimental setup. During the training stage, the batch size is set to 48. In every epoch, the model is validated 10 times to save the best parameters using validation accuracy. Early stopping is applied when the validation performance does not improve after 10 consecutive times. Through numerous parameter experiments, we finally determined the high-dimensional feature distribution learning module with weight parameter 1, and the high-frequency invariant feature learning module with weight parameter 16, BCEloss weight parameter is 1. In the following experiments, we use the accuracy score ACC (%) as our evaluation metric. We train our models on RTX 3070 GPUs.
Detection performance
Table 2
Experimental results (Acc%) of our proposed method and other eight different baseline approaches on four different deepfake datasets. The best results among all the methods are denoted in bold
The differences between face forgery datasets mainly lie in the variations of source videos and face manipulation methods. To evaluate the cross-manipulations generalization capability of different face forgery detectors and prevent the possible bias introduced by different source videos, we conduct experiments on FF++, as it provides fake videos created by multiple face forgery methods for the same source videos.
Table 3
Classification accuracy (%) of our training framework and baseline (ADD-ResNet50) on FaceForenscics++ regarding Real images and Fake images
Datasets
Models
C23
C40
Real
Fake
Real
Fake
DeepFakes
Baseline
93.98
95.93
89.34
94.56
Ours
96.38
99.26
96.85
94.06
Face2Face
Baseline
90.88
95.02
87.26
88.68
Ours
98.05
95.07
94.66
84.94
FaceSwap
Baseline
96.94
87.92
89.09
84.35
Ours
98.43
94.56
96.58
84.72
NT
Baseline
83.84
89.33
84.41
68.21
Ours
94.87
89.04
81.47
75.16
ALL
Baseline
93.98
95.93
89.34
94.56
Ours
90.13
95.96
90.12
94.84
In this section, we compare our proposed method with current state-of-the-art face forgery detection methods. As is shown in Table 2, for FF++ dataset, our method consistently achieves great improvements under different quality settings. Especially on the challenging C40 setting, compared with ADD-ResNet50 [23], the ACC score of our method exceeds it by 13.48%, there are also a 3.22% improvement compared to the most recent paper [57]. To explain, we use domain invariant features to portray the correct artifacts, which are further learned by the network through high-dimensional distribution invariant feature learning; and the multi-scale and multi-angle design allows the network to learn more artifact traces, improving low-quality fake face detection. And we can find that our method has more significant results on those more challenging datasets, such as Face2Face and NeuralTextures datasets. Because our network learns invariant features in Raw and C40 data in real time in an end-to-end approach, it makes the network more secure in learning artifacts and outperforms previous networks that perform feature learning through obvious artifact traces. This demonstrates the effectiveness of our proposed method.
Table 4
The classification ACC (%) and the Area Under Curve AUC (%) of our method and baseline, the experimental results of baseline are replicated
Datasets
Models
C23
C40
ACC
AUC
ACC
AUC
DeepFakes
Baseline
94.96
98.63
91.90
97.53
Ours
97.82
99.84
95.47
99.23
Face2Face
Baseline
92.72
97.80
87.91
94.89
Ours
96.74
98.67
90.25
96.08
FaceSwap
Baseline
92.89
97.40
86.95
93.68
Ours
96.70
99.20
91.30
96.73
NT
Baseline
86.35
94.26
77.17
85.39
Ours
92.26
96.72
78.65
87.04
ALL
Baseline
87.62
93.31
83.69
88.01
Ours
93.00
97.57
84.95
89.40
×
In order to perform a more accurate analysis, we broadcast the accuracy of real and fake images separately. From the baseline, we can clearly find that the accuracy of the model for real images is usually lower than that of Fake images. This indicates to some extent that we can improve the detection by increasing the discrimination rate of real images, and as shown in the Table 3, our model does this. This again illustrates that our model can capture the common invariant artifact traces in both data streams well and achieve more accurate classification. And as can be seen in Table 4, the method proposed in this paper outperforms the baseline in terms of both ACC and AUC.
In order to illustrate the superior of our proposed method, we also visualized the t-SNE [41] feature spaces of different data in FF++(LQ) to explore the influence of our components on the distribution of learned representations. We can observe from Fig. 9 that the features extracted from ADD-ResNet50 are compactly gathered in the t-SNE embedding space, which limits the discrimination of the forged faces against real faces.
×
In particular, the features of NeuralTextures and Face2Face fake faces and real faces are compacted together, because this method only performs small-scale manipulation. After adding domain invariant features extracted from the Raw and C40 domain by hf-IFLM and hd-FDLM, the distribution of learned fusion representations has changed. More manipulated faces tend to be farther from real faces and other categories, representations of the same class are pulled together, while the distances between classes are significantly increased. These distribution changes show that the common invariant artifacts captured by our method in different domains help distinguish fake faces from real ones.
We visualize the classification accuracy of our method and baseline on FF++. And as shown in Fig. 10, our proposed method DIFLD outperforms the baseline ADD-ResNet50 model in overall.
Ablation study and discussions
In this subsection, we investigate the feasibility of the two modules proposed in this paper, and the results are shown in Table 5.
Table 5
The effect of each module on the final results experimented on C40 NeuralTextures dataset
Method
ACC (%)
ADD - ResNet50
68.53
W/o hf-IFLM module
81.16
W/o hd-FDLM module
80.56
Our method
82.01
In the past, NeuralTextures (NT) datasets have proven to be the most difficult to distinguish by human eyes and DNNs. Therefore, we conducted the ablation study on the C40 high-quality NT dataset. We can observe that hf-IFLM module improves the accuracy by about 12.03%. The hd-FDLM improves 12.63%. Finally, combining the hf-IFLM and the hd-FDLM significantly improved accuracy by 82.01%. The results of our ablation study suggest that each proposed modules contributes differently to framework’ ability to learn the domain invariant features from Raw and C40 datasets, and that they are compatible when integrated together for optimal performance.
Conclusion
In this paper, using domain invariant features to inscribe the correct causality, we propose a new deep forged face detection algorithm that improves the robustness under low-quality images. Specifically, we first exploit the distribution-invariant property to extract domain knowledge and design a distribution-invariant feature extraction module to extract artifacts. Then, we propose the high-frequency enhancement module to enhance the detection capability of low-quality images, by exploiting high-frequency information from the Raw dataset. Experimental results on the FaceForensc++ dataset, especially on the C40 compressed dataset, show that the algorithm outperforms other state-of-the-art algorithms.
DIFLD ensures the comprehensiveness of the model when detecting artifacts, thus improving the detection capability, but, therefore, reducing the generalization capability to some extent. Therefore, in the future, improving the generalization capability is an issue that needs to be considered. In addition, since all state-of-the-art algorithms including the algorithm proposed in this paper focus on plain text deep forgery video detection, we will try to detect encrypted deep forgery videos to protect privacy.
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China (Grant Number 61971078) and Chongqing University of Technology Graduate Education Quality Development Action Plan Funding Results (Grant Number gzlcx20223206), which provided domain expertise and computational power that greatly assisted the activity.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.