Local Region Frequency Guided Dynamic Inconsistency Network for Deepfake Video Detection

Pengfei Yue; Beijing Chen; Zhangjie Fu

doi:10.26599/BDMA.2024.9020030

| Sign up

PDF (15.4 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (8)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Tables (5)

Table 1

Table 2

Table 3

Table 4

Open Access

Local Region Frequency Guided Dynamic Inconsistency Network for Deepfake Video Detection

Pengfei Yue^¹, Beijing Chen^¹(), Zhangjie Fu

Engineering Research Center of Digital Forensics affiliated with Ministry of Education, and also with School of Computer Science, and also with Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing 210044, China

Show Author Information

Abstract

In recent years, with the rapid development of deepfake technology, a large number of deepfake videos have emerged on the Internet, which poses a huge threat to national politics, social stability, and personal privacy. Although many existing deepfake detection methods exhibit excellent performance for known manipulations, their detection capabilities are not strong when faced with unknown manipulations. Therefore, in order to obtain better generalization ability, this paper analyzes global and local inter-frame dynamic inconsistencies from the perspective of spatial and frequency domains, and proposes a Local region Frequency Guided Dynamic Inconsistency Network (LFGDIN). The network includes two parts: Global SpatioTemporal Network (GSTN) and Local Region Frequency Guided Module (LRFGM). The GSTN is responsible for capturing the dynamic information of the entire face, while the LRFGM focuses on extracting the frequency dynamic information of the eyes and mouth. The LRFGM guides the GTSN to concentrate on dynamic inconsistency in some significant local regions through local region alignment, so as to improve the model’s detection performance. Experiments on the three public datasets (FF++, DFDC, and Celeb-DF) show that compared with many recent advanced methods, the proposed method achieves better detection results when detecting deepfake videos of unknown manipulation types.

Keywords

deepfake video detection dynamic inconsistency local region local region frequency

References

[1]

M. Tora, deepfakes, https://github.com/deepfakes/faceswap/tree/v2.0.0, 2018.

[2]

K. Liu, I. Perov, D. Gao, N. Chervoniy, W. Zhou, and W. Zhang, Deepfacelab: Integrated, flexible and extensible face-swapping framework, Pattern Recognition, vol. 141, p. 109628, 2023.

Crossref Google Scholar

[3]

M. Kowalski, FaceSwap, https://github.com/marekkowalski/faceswap, 2018.

[4]

H. Lin, W. Huang, W. Luo, and W. Lu, deepfake detection with multi-scale convolution and vision transformer, Digital Signal Processing, vol. 134, p. 103895, 2023.

Crossref Google Scholar

[5]

H. Zhao, T. Wei, W. Zhou, W. Zhang, D. Chen, and N. Yu, Multi-attentional deepfake detection, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR ), Nashville, TN, USA, 2021, pp. 2185–2194.

Method	FFFE	LRAA	AUC on intra-dataset					AUC on cross-dataset
Method	FFFE	LRAA	DF	F2F	FS	NT	FF++	Celeb-DF	DFDC	DiffFace	DiffSwap
(a)	−	−	0.995	0.985	0.992	0.978	0.990	0.834	0.771	0.760	0.717
(b)	−	√	0.997	0.991	0.997	0.982	0.993	0.854	0.783	0.835	0.803
(c)	√	−	0.999	0.997	0.998	0.982	0.994	0.863	0.763	0.806	0.775
(d)	√	√	0.999	0.999	0.999	0.990	0.997	0.904	0.808	0.903	0.857

Method		AUC
Intra-frame	Xception^[40]	0.997
	F3-Net^[8]	0.986
	LTW^[41]	0.991
	DCL^[42]	0.993
	RECCE^[43]	0.991
	MSFCL^[44]	0.993
	DIFL^[45]	0.993
Inter-frame	ST-ILIF^[18]	0.986
	FCAN-DCT^[46]	0.990
	Ours	0.997

Method		Manipulation				Average
Method		DF	FS	F2F	NT	Average
Intra-frame	Xception^[40]	0.939	0.512	0.868	0.797	0.779
	LTW^[41]	0.927	0.640	0.802	0.773	0.785
	DCL^[42]	0.949	−	0.829	−	−
	RECCE^[43]	0.920	0.625	0.813	0.783	0.785
	MSFCL^[44]	0.941	0.656	0.814	0.792	0.801
	DIFL^[45]	0.947	0.779	0.856	0.802	0.846
Inter-frame	Ours	0.962	0.805	0.905	0.817	0.872

Algorithm 1　Procedure of training LFGDIN
Input:
$T$ global face frames $c_{u}$ , $u \in {1, 2, \dots, T};$
$T$ local region boxes $o_{u}$ ; // Obtained by using RetinaFace to locate local regions within the global face frames, it is not yet downsampled
$T$ local region frames $e_{u}$ ; // Obtained by cropping and splicing based on local region boxes
Label $a$ of video
Output: Classification result $y$
1: for $iter = 0$ to $N_{i t e r} - 1$ do; // $N_{i t e r}$ is the number of iterations
2:　 $f^{grgb} = G S T N_S t a g e 1 (c)$ ; // Extracting global spatiotemporal features from intermediate layers through GSTN
3:　for $u = 1$ to $T$ do; // Extracting frequency fusion features for each local region frame $e_{u}$ ;
4:　　Extracting multi-band feature $f_{u}^{mb}$ by applying Eqs. (8) and (9) from $e_{u}$ ;
5:　　Extracting block-wise frequency feature $f_{u}^{bw}$ from $e_{u}$ ;
6:　　Acquiring SRM attention map $m_{u}^{SRM}$ from $e_{u}$ ;
7:　　Extracting local region frequency fusion feature $f_{fusion, u}^{lfreq}$ by applying Eq. (10);
8:　end for
9:　 $f_{dyn}^{lfreq} = X3D (f_{fusion}^{lfreq})$ ; // Extracting local region frequency dynamic features through X3D
10:　for $u = 1$ to $T$ do; // Acquiring ROI attention map for $f_{d y n, u}^{lfreq}$
11:　　Acquiring spatial attention $m_{u}^{freq}$ by applying Eq. (11);
12:　　Acquiring ROI attention map $m_{u}^{roi}$ by applying Eq. (12) based on $o_{u}$ ;
13:　end for
14:　Extracting guided final global spatiotemporal features ${ff}^{grgb}$ by applying Eq. (13);
15:　 $y = G S T N_S t a g e 2 ({f f}^{g r g b})$ ; // Acquiring classification result through GSTN
16:　 $loss = L_{bce} (y, a)$ ; // Computing $l o s s$ by using the binary cross-entropy loss function
17:　back_propagation ( $loss$ ); // Computing gradients
18:　update ( $LFGDIN$ ); // Updating the parameters of $LFGDIN$ using AdamW
19: end for