DSViT: An Enhanced Transformer Model for Deepfake Detection

Authors

  • Pham Minh Thuan
  • Bui Thu Lam
  • Pham Duy Trung

DOI:

https://doi.org/10.54654/isj.v2i22.1055

Keywords:

deepfake detection, DSViT, deepfake, spatial deepfake

Tóm tắt

The rapid development of artificial intelligence and deep learning models has enabled the creation of highly realistic fake images and videos, posing significant threats to information security and safety. Accurate detection of these forged contents is crucial to prevent the spread of misinformation and to protect the integrity of digital media. Although several advanced studies in this field, such as Vision Transformer (ViT) and Convolutional Vision Transformer (CViT), have been conducted, there remain limitations that need to be addressed. In this paper, we introduce a novel model, improved from CViT, designed to optimize the process of deepfake detection, named DSViT (Deepfake Detection with SC-based Convolutional Vision Transformer). This model judiciously integrates Convolutions and a SCConvolution block with the ViT architecture. We conducted experiments on the Deepfake Detection Challenge (DFDC) dataset and compared the results with the CViT model to demonstrate the effectiveness of the proposed model

Downloads

Download data is not yet available.

References

F. Abbas and A. Taeihagh, “Unmasking deepfakes: A systematic review of deepfake detection and generation techniques using artificial intelligence,” Expert Systems With Applications, 2024: 124260.

A. Naitali, M. Ridouani, F. Salahdine, M. Kaabouch, “Deepfake attacks: Generation, detection, datasets, challenges, and research directions,” Computers, vol. 12, no. 10, pp. 216, Oct 2023.

X. Li, H. Zhou, and M. Zhao, “Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection,” Mathematical Biosciences and Engineering, vol. 21, no. 3, pp. 4142-4164, 2024.

D. Wodajo & S. Atnafu, “Deepfake Video Detection Using Convolutional Vision Transformer”, arXiv preprint arXiv:2102.11126, 2021.

F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.

H. H. Nguyen, N. T. Tieu & I. Echizen, “Capsule-Forensics: Using Capsule Networks to Detect Forged Images and Videos”, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2307-2311, May 2019.

D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “MesoNet: A Compact Facial Video Forgery Detection Network,” arXiv:1809.00888, Sep 2018.

M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in International Conference on Machine Learning (ICML), Jun 2019.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in International Conference on Learning Representations (ICLR), May 2021.

A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić & C. Schmid, “ViViT: A Video Vision Transformer”, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6816-6826, 2021.

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” in International Conference on Computer Vision (ICCV), Oct 2021.

W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Feb 2022.

J. Li, Y. Wen, and L. He, “SCConv: Spatial and channel reconstruction convolution for feature redundancy,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Kagge (2020), Deepfake Detection Challenge. Accessed September 10, 2024, from: https://www.kaggle.com/c/deepfake-detection-challenge/data.

Downloads

Abstract views: 104 / PDF downloads: 84

Published

2024-10-01

How to Cite

Thuan, P. M., Lam, B. T., & Trung, P. D. (2024). DSViT: An Enhanced Transformer Model for Deepfake Detection. Journal of Science and Technology on Information Security, 2(22), 17-28. https://doi.org/10.54654/isj.v2i22.1055

Issue

Section

Papers