FakeSTormer

📜 Abstract

Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods.

⚡ FakeSTormer

We propose FakeSTormer to address two key challenges: (i) improving the generalizability of video-based deepfake detectors while (ii) being robustness to more high-quality deepfake videos. To achieve this, we redefine deepfake video detection as a fine-grained detection task by proposing a multi-branch network that leverages synthesized data and incorporates specialized learning objectives specifically targeting both subtle spatial and temporal artifacts.

Method

I) Overview of the proposed framework: Our multi-task learning framework, FakeSTormer, consists of three branches, i.e., the temporal branch \( h \), the spatial branch \( g \), and the standard classification branch \( f \). Those branches are specially designed to facilitate disentanglement learning of spatial-temporal features. The hand-free ground-truth data to train the framework are generated based on our proposed video-level data synthesis algorithm coupled with a vulnerability-driven Cutout strategy.

II) Overview of generating a self-blended video: It contains two main components, including a landmark interpolation module (LI) and the consistent utilization of synthesized parameters (CSP).

III) Examples of pseudo-fake videos: with(w/) and without(w/o) vulnerability-driven Cutout and their corresponding soft labels. We apply the Cutout data augmentation at the same spatial locations throughout video frames.

IV) Extraction of temporal vulnerabilities: We compute derivatives of the spatial vulnerabilities over time.

🖼️ Experiment Result

To assess the generalization capabilities of our method, we conduct evaluations using four challenging evaluation setups: i) Generalization to unseen datasets,i.e., datasets other than the training dataset (FF++); ii) Generalization on heavily compressed data; iii) Generalization to unseen manipulations; iv) Robustness to unseen perturbations.

Generalization to Unseen Datasets

Generalization on Heavily Compressed Data and Unseen Manipulations

Robustness to Unseen Perturbations

Visualization of Saliency Maps

❗ Please refer to the main paper for detailed ablation experiments! ❗