Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection

——— ICCV 2025 ———
   
1. CVI2, SnT, University of Luxembourg, Luxembourg    
2. Cristal Laboratory, National School of Computer Sciences, University of Manouba    
TL;DR: We introduce FakeSTormer - A self-supervised multi-task learning framework that incorporates learning objectives specifically targeting both spatial and fine-grained temporal vulnerabilities for generalizable deepfake video detection.

📜 Abstract

Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods.

⚡ FakeSTormer

We propose FakeSTormer to address two key challenges: (i) improving the generalizability of video-based deepfake detectors while (ii) being robustness to more high-quality deepfake videos. To achieve this, we redefine deepfake video detection as a fine-grained detection task by proposing a multi-branch network that leverages synthesized data and incorporates specialized learning objectives specifically targeting both subtle spatial and temporal artifacts.

Method

Framework Overview

I) Overview of the proposed framework: Our multi-task learning framework, FakeSTormer, consists of three branches, i.e., the temporal branch \( h \), the spatial branch \( g \), and the standard classification branch \( f \). Those branches are specially designed to facilitate disentanglement learning of spatial-temporal features. The hand-free ground-truth data to train the framework are generated based on our proposed video-level data synthesis algorithm coupled with a vulnerability-driven Cutout strategy.

II) Overview of generating a self-blended video: It contains two main components, including a landmark interpolation module (LI) and the consistent utilization of synthesized parameters (CSP).

III) Examples of pseudo-fake videos: with(w/) and without(w/o) vulnerability-driven Cutout and their corresponding soft labels. We apply the Cutout data augmentation at the same spatial locations throughout video frames.

IV) Extraction of temporal vulnerabilities: We compute derivatives of the spatial vulnerabilities over time.

🖼️ Experiment Result

To assess the generalization capabilities of our method, we conduct evaluations using four challenging evaluation setups: i) Generalization to unseen datasets,i.e., datasets other than the training dataset (FF++); ii) Generalization on heavily compressed data; iii) Generalization to unseen manipulations; iv) Robustness to unseen perturbations.


Generalization to Unseen Datasets



Generalization on Heavily Compressed Data and Unseen Manipulations


Robustness to Unseen Perturbations


Visualization of Saliency Maps

❗ Please refer to the main paper for detailed ablation experiments! ❗

📥 BibTeX


    @article{nguyen2025vulnerability,
        title={Vulnerability-Aware Spatio-Temporal Learning for Generalizable and Interpretable Deepfake Video Detection},
        author={Nguyen, Dat and Astrid, Marcella and Kacem, Anis and Ghorbel, Enjie and Aouada, Djamila},
        journal={arXiv preprint arXiv:2501.01184},
        year={2025}
    }
  

💌 Acknowledgement

This work is supported by the Luxembourg National Research Fund, under the BRIDGES2021/IS/16353350/FaKeDeTeR, and by POST Luxembourg. Experiments were performed on the Luxembourg national supercomputer MeluXina. The authors gratefully acknowledge the LuxProvide teams for their expert support.



This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.