AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research | Open Access

CLFormer: a cross-lingual transformer framework for temporal forgery localization

Haonan Cheng^¹

, Hanyue Liu^²

, Juanjuan Cai^³

(

), Long Ye^{¹^,⁴}

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China

School of Information and Communication Engineering, Communication University of China, Beijing 100024, China

Key Laboratory of Media Audio & Video (Communication University of China), Ministry of Education, Communication University of China, Beijing 100024, China

School of Data Science and Media Intelligence, Communication University of China, Beijing 100024, China

Show Author Information

Abstract

Temporal forgery localization (TFL) is crucial in deepfake detection. It focuses on identifying subtle temporal manipulations within video content. However, the generalization capabilities of current TFL methods are limited, especially across different languages, which limits their performance in diverse environments. This limitation stems from two key factors. First, most existing datasets are English-centric. Second, there is inadequate learning from multi-modal information, where visual features are often prioritized over audio features. To address this gap, we created the Chinese audio-visual deepfake (CHAV-DF) dataset, which is the first dataset designed for the TFL in the Chinese context. This dataset provides a valuable benchmark for evaluating the TFL methods in cross-lingual settings. Additionally, we introduced a cross-lingual transformer framework (CLFormer), which prioritizes audio features and utilizes a pre-trained multi-lingual Wav2Vec2 to enhance cross-lingual generalization, while incorporating visual features to further refine TFL. Moreover, we incorporated a refinement module into CLFormer to enhance the accuracy of forgery localization. Experiments on the LAV-DF, CHAV-DF, and AV-Deepfake1M datasets demonstrate that CLFormer performs well in both same-language and cross-language settings. Specifically, CLFormer achieves an average precision (AP) of 57.68% at temporal intersection over union (tIoU) of 0.50 when trained on CHAV-DF and tested on LAV-DF, surpassing the state-of-the-art method by 47.59%, and validating its cross-language generalization capability.

Keywords

Temporal forgery localization (TFL)Cross-lingual Audio feature Wav2Vec2 Boundary refinement

References

【1】

Crossref Google Scholar

Visual Intelligence

Volume 3,
2025

Article number: 13

DOI: 10.1007/s44267-025-00084-z

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Cheng H, Liu H, Cai J, et al. CLFormer: a cross-lingual transformer framework for temporal forgery localization. Visual Intelligence, 2025, 3: 13. https://doi.org/10.1007/s44267-025-00084-z

642

Views

Crossref

Google Scholar
Citation

Received: 15 November 2024

Revised: 21 June 2025

Accepted: 23 June 2025

Published: 06 November 2025

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.