A passive approach for effective detection and localization of region-level video forgery with spatio-temporal coherence analysis

doi:10.1016/j.diin.2014.03.016

Digital Investigation

Volume 11, Issue 2, June 2014, Pages 120-140

https://doi.org/10.1016/j.diin.2014.03.016 Get rights and content

Abstract

In this paper, we present a passive approach for effective detection and localization of region-level forgery from video sequences possibly with camera motion. As most digital image/video capture devices do not have modules for embedding watermark or signature, passive forgery detection which aims to detect the traces of tampering without embedded information has become the major focus of recent research. However, most of current passive approaches either work only for frame-level detection and cannot localize region-level forgery, or suffer from high false detection rates for localization of tampered regions. In this paper, we investigate two common region-level inpainting methods for object removal, temporal copy-and-paste and exemplar-based texture synthesis, and propose a new approach based on spatio-temporal coherence analysis for detection and localization of tampered regions. Our approach can handle camera motion and multiple object removal. Experiments show that our approach outperforms previous approaches, and can effectively detect and localize regions tampered by temporal copy-and-paste and texture synthesis.

Introduction

Visual imagery has been widely used to provide essential evidence in many diverse areas, ranging from mainstream media, journalism, and scientific publication, to medical imaging, criminal investigations, and surveillance systems, to name a few. While we have historically had confidence with the integrity and authenticity of visual imagery, such trust has been gradually lost. With the rapid growth of digital devices and multimedia editing technology (Perez et al., 2003; Kwatra et al., 2003, Criminisi et al., 2004, Shen et al., 2006; Hays and Efros, 2007, Komodakis and Tziritas, 2007, Patwardhan Kedar et al., 2007, Ling et al., 2011), it has become easier than ever to produce and modify digital videos with increasing sophistication. Doctored videos are very difficult, if not impossible, to identify through visual examination. Therefore, digital video forensics, which aims to verify the trustworthiness of digital video, has become an important and exciting field of recent research.

There are two types of digital image/video forensics: active and passive. In active approaches, a watermark, which provides information to verify the integrity and authenticity of digital images/videos, is inserted into an image (or a video frame) while it is acquired (Cox et al., 2002, Guo et al., 2006, Martino and Sessa, 2012). Unfortunately, many digital image capture devices do not contain the module to insert watermarks. Therefore, passive approaches, which aim to detect traces of tampering without using prior information, are extensively studied in recent research (Farid, 2009, Mahdian and Saic, 2010, Milani et al., 2012).

Over the past few years, a number of passive approaches have been proposed, which can be roughly classified into four categories, pixel-based, format-based, camera-based, and geometric-based. Pixel-based approaches examine pixel level anomalies caused by tampering, such as correlations between frames arising from duplicate frames (Wang and Farid, 2007a, Lin et al., 2011) and tampered regions (Zhang et al., 2009, Lin and Tsay, 2013, Kirchner et al., 2013). Format based approaches exploit the unique properties of video compression, such as periodic properties (Wang and Farid, 2006, Wang and Farid, 2009, Sun et al., 2012) and blocking artifacts (Luo et al., 2008) in MPEG-1 and MPEG-2 videos. Camera-based approaches analyze the specific sensor artifacts caused by components in the imaging pipeline, such as sensor noise (Mondaini et al., 2007, Hsu et al., 2007, Houten and Geradts, 2009, Kobayashi et al., 2010) and interlaced scanning (Wang and Farid, 2007b). Geometric-based approaches inspect the geometric properties of objects and their positions relative to the camera (Conotter et al., 2012).

Previous research has studied the problem of frame duplication detection (Wang and Farid, 2007a, Lin et al., 2011), MPEG based forgery detection (Wang and Farid, 2006, Wang and Farid, 2009, Luo et al., 2008, Sun et al., 2012), and localization of tampered areas (Hsu et al., 2007, Zhang et al., 2009, Kirchner et al., 2013, Lin and Tsay, 2013). It has been noticed that temporal correlation information is very useful for detection of tampering for video sequences as temporal correlation are often destroyed when successive frames are tampered. Notice that most of previous approaches are effective mainly for detection of malicious manipulation. How to localize tampered areas is not either addressed or effectively solved by them.

For pixel-based approaches, Wang and Farid (2007a) used the high correlation between original and forged regions to detect copy-and-paste forgery. However, as high correlation is common in natural videos, their method does not work well if copied regions are taken from other videos. Lin et al. (2011) proposed a coarse-to-fine-grained approach which uses the difference in color histograms of adjacent frames to identify candidate clips from the temporal domain, and uses the block-based correlation algorithm in the spatial domain to measure the similarities between the query clip and candidate clip, in order to detect the duplicated clips. Since their approach performs only frame-level detection, it cannot localize region-level forgery. Zhang et al. (2009) proposed an approach which uses ghost shadow artifacts introduced by inpainting to detect forged regions. However, their approach is vulnerable to the effects of noise such as illumination changes, and cannot accurately locate the tampered areas in each frame.

For format-based approaches, Wang and Farid (2006) used spatial and temporal artifacts of double MPEG compression. An MPEG video sequence in I-frame is similar to a sequence of JPEG compressed images; however, there is more correlation among frames in a given Group of Pictures (GOP). Thus, detection of I-frame double compression is similar to JPEG double compression detection, and in a GOP, adding or deleting a frame will increase the error of motion estimation. However, their work is only effective in frame tamper detection, and cannot locate tampered regions. Luo et al. (2008) extended their image forensic method proposed in Luo et al. (2007), and used the temporal patterns of blocking artifacts to detect whether an MPEG video has suffered frame removal or addition before recompression. Both approaches are frame-level, and do not address the issue of region-level tampering detection and localization. For camera-based approaches, Hsu et al. (2007) used block based correlation of noise residue to detect and locate forged regions in a video. This method is based on the observation that tampering will alter the noise residue correlation, rendering it different from that of non-tampered areas. However, the correlation of noise residue is unstable for videos captured by moving camera and sensitive to quantization noise. Wang and Farid (2007b) examined the consistency of de-interlacing parameters which are used to convert an interlaced video into a non-interlaced video. Since interlaced videos contain the half of vertical resolution of the original videos, the de-interlacing procedure exploits interpolation, insertion, and duplication of frames to produce full-resolution videos. While their method can detect tampered regions in a video, it limits the form of the video, that is, de-interlaced or interlaced. Kobayashi et al. (2010) used photon shot noise to detect forged regions. Their work exploits the inconsistencies of photon shot noise caused by different video cameras to detect forged regions. However, their approach can only detect forgery in static-scene videos. Thus, it cannot be applied to detect tampered regions from videos captured by moving cameras.

In this paper, we study the problem of detection and localization of tampered regions resulting from removal of unwanted objects from videos, and present an effective and robust approach based on spatio-temporal coherence analysis. Our approach can handle videos with camera motion, and effectively localize the regions in video sequences tampered by well known inpainting methods such as temporal copy-and-paste and exemplar-based texture synthesis. To handle camera motion, we first calculate frame motion vectors between all pairs of successive frames, and use them to partition the video sequence into groups so that motion within the same group is small enough to be neglected in coherence analysis. We then align successive frames in the same group according to relative motions among them. After frame alignment, spatial-temporal slices are analyzed to detect whether the video has suffered unnaturally high or abnormally low coherence in successive frames. Finally, we perform a detection and localization process which detects and localizes tampered regions.

We have carried out experiments over major inpainting methods, such as temporal copy-and-paste and exemplar-based texture synthesis which are used to fill holes left by removal of unwanted objects in the video. The result shows that the proposed approach outperforms the ADI approach proposed in Zhang et al. (2009), and can effectively detect and localize tampered areas for videos from either static or moving cameras. It should be noted that a preliminary version of the proposed approach has been sketched in Lin and Tsay (2013). This paper is the complete version which elaborates all details, carries out more extensive experiments and makes significant improvement.

The remainder of this paper is organized as follows. Section Overview briefly overviews the problem, and sketches our main approach. Section Frame grouping and alignment gives the details of frame grouping and alignment. Section Spatio-temporal coherence analysis gives the details of spatio-temporal coherence analysis. Section Tampered slice detection and region localization gives the details of tampered slice detection and region localization. Section Experimental results gives experimental results, and improvement for compressed videos. Section Conclusion concludes.

Section snippets

Overview

In this section, we briefly overview the spatio-temporal slice technique, and sketch our proposed approach for detection and localization of tampered regions caused by object removal in video sequences.

Spatio-temporal slice technique is widely used in analyzing the spatio and temporal relationships in video sequences (Adelson and Bergen, 1985), and is commonly used in various research areas, such as motion analysis and segmentation (Ngo et al., 2003), human gait analysis (Niyogi and Adelson,

Frame grouping and alignment

A video of length L is a sequence of L frames f₁, …, f_L. Each frame f_t, t ∈ [1, L], is an image of M × N pixels. To deal with camera motion, we first compute camera motions between all-pairs of successive frames. We then partition the video sequence into frame groups so that the camera motion within each group is smaller than a predefined threshold. Frames in the same group are then aligned to form a 3D volume for spatio-temporal coherence analysis.

The camera motion between successive frames f_t-1

Spatio-temporal coherence analysis

Spatio-temporal coherence analysis is performed over each frame group independently. The objective is to produce a group coherence abnormality pattern (GCAP) for each frame group, which can be used in subsequent steps to identify regions with unnaturally high or abnormally low coherence. For simplicity, we assume the video sequence consists of only one frame group in the following discussion. Otherwise, we simply repeat the following analysis for each group.

For each spatio-temporal slice, we

Tampered slice detection and region localization

In this section, we describe how to determine if a spatio-temporal slice contains tampered regions, and how to localize tampered regions in tampered slices.

Consider spatio-temporal slices of the 3D volume formed by all frames in some frame group. Each spatio-temporal slice is compared to its group coherence abnormality pattern GCAP to determine if it is tampered. In the following discussion, we explained how to identify tampered regions, using GCAP.

Let sim(SHM_i, GCAP) and diff(SHM_i, GCAP) be

Experimental results

In this section, we measure the performance of our approach over 18 test videos, and present comparison to the ADI approach proposed in (Zhang et al., 2009). We further give an improvement of our approach for videos which are saved in compressed formats after being tampered.

Discussion and remarks

In real circumstances, object removal from the scene is one of the major attacks in applications such as, for example, the video surveillance. Furthermore, to create plausible tampered videos, TCP and ETS are the common techniques used for filling the holes left by removal of unwanted objects. How to detect and localize tampered regions caused by object removal is an important task for digital forensic practitioners. Although a number of approaches have been proposed, most of them work for

Conclusion

In this paper, we have investigated two common region-level inpainting methods, temporal copy-and-paste and exemplar-based texture synthesis, for filling the holes left by object removal, and proposed a new approach based on spatio-temporal coherence analysis for passive detection and localization of tampered regions. Our approach handles camera motion by frame grouping and alignment. Experiments show that our approach can effectively detect and localize regions manipulated by temporal

References (39)

H. Guo et al.
A fragile watermarking scheme for detecting malicious modifications of database relations
Inf Sci
(2006)
B. Mahdian et al.
A bibliography on blind methods for identifying image forgery
Signal Process Image Commun
(2010)
E.H. Adelson et al.
Spatiotemporal energy models for the perception of motion
J Opt Soc Am Opt Image Sci
(1985)
A. Criminisi et al.
Region filling and object removal by exemplar-based image inpainting
IEEE Transactions Image Process
(2004)
I.J. Cox et al.
Digital watermarking and fundamentals
(2002)
V. Conotter et al.
Exposing digital forgeries in ballistic motion
IEEE Transactions Information Forensics Secur
(2012)
H. Farid
A survey of image forgery detection
IEEE Signal Process Mag
(2009)
S. Goferman et al.
TalA. Context-aware saliency detection
R.C. Gonzalez et al.
Digital image processing
(2008)
J. Hays et al.
Scene completion using millions of photographs
ACM Transactions Graph
(2007)

C.-C. Hsu et al.

Video forgery detection using correlation of noise residue

Proc. of IEEE Int. Conf. on multimedia signal processing

(2007)

W.V. Houten et al.

Source video camera identification for multiply compressed videos originating from YouTube

Digit Investig

(2009)

N. Komodakis et al.

Image completion using efficient belief propagation via priority scheduling and dynamic pruning

IEEE Transactions Image Process

(2007)

V. Kwatra et al.

Graphcut textures image and video synthesis using graphcuts

ACM Transactions Graph

(2003)

M. Kirchner et al.

Impeding forgers at photo inception

M. Kobayashi et al.

Detecting forgery from static-scene video based on inconsistency in noise level functions

IEEE Transactions Information Forensics Secur

(2010)

C.-H. Ling et al.

Virtual contour guided video object inpainting using posture mapping and retrieval

IEEE Transactions Multimedia

(2011)

G.-S. Lin et al.

Detecting frame duplication based on spatial and temporal analyses

C.-S. Lin et al.

Passive approach for video forgery detection and localization

Cited by (69)

Deep video inpainting detection and localization based on ConvNeXt dual-stream network
2024, Expert Systems with Applications
Currently, deep learning-based video inpainting algorithms can fill in a specified video region with visually plausible content, usually leaving imperceptible traces. Since deep video inpainting methods can be used to maliciously manipulate video content, there is an urgent need for an effective method to detect and localize deep video inpainting. In this paper, we propose a dual-stream video inpainting detection network, which includes a ConvNeXt dual-stream encoder and a multi-scale feature cross-fusion decoder. To further explore the spatial and temporal traces left by deep inpainting, we extract motion residuals and enhance them using 3D convolution and SRM filtering. Furthermore, we extract filtered residuals using LoG and Laplacian filtering. These residuals are then entered into ConvNeXt, thereby learning discriminative inpainting features. To enhance detection accuracy, we design a top-down pyramid decoder that aims at deep fusion of multi-dimensional multi-scale features to fully exploit the information of different dimensions and levels in detail. We created two datasets containing state-of-the-art video inpainting algorithms and conducted various experiments to evaluate our approach. The experimental results demonstrate that our approach outperforms existing methods and attains a competitive performance despite encountering unseen inpainting algorithms.
Effective and efficient pixel-level detection for diverse video copy-move forgery types
2022, Pattern Recognition
Video copy-move forgery detection (VCMFD) is a significant and greatly challenging task due to a variety of difficulties, including a huge amount of video information, diverse forgery types, rich forgery objects, and homogenous forgery sources. These difficulties raise four unresolved key challenges in VCMFD: i) ineffective detection in some popular forgery cases; ii) inefficient matching in processing numerous video pixels with hundred-dimensional features under dozens of matching iterations; iii) high false positive (F_P) in detecting forgery videos; iv) low trade-off of efficiency and effectiveness in filling forgery region, and even failing in indicating forgeries at the pixel level. In this paper, a novel VCMFD method is proposed to address these issues: i) an innovatively improved SIFT structure that can address the thorough feature extraction in all video copy-move forgery cases; ii) a novel fast keypoint-label matching (FKLM) algorithm is proposed that creates some keypoint-label groups so that every high-dimensional feature is assigned into one of these groups. As a result, matching of video pixels can be directly done on a small number of keypoint-label groups only, leading to a nearly 500% raise in matching efficiency; iii) a new coarse-to-fine filtering relying on intrinsic attributes of exact keypoint-matches is designed to more effectively reduce the false keypoint-matches; iv) the adaptive block filling relying on true keypoint-matches contributes to the accurate and efficient suspicious region filling, even at the pixel level. Finally, the suspicious region locations with the forgery vision persistence concept indicate forgery videos. Compared to the state-of-art methods, the experiments show that our proposed method achieves the best detection accuracy, lowest F_P_, and improved at least 16% and 8% of F₁ scores on the GRIP 2.0 dataset and a combination of SULFA 2.0 & REWIND datasets. Furthermore, the proposed method is with low computational time (4.45 s/Mpixels), which is about 1/2-1/3 times of the latest DFMI-BM (8.02 s/Mpixels) and PM-2D (13.1 s/Mpixels) methods.
Dense moment feature index and best match algorithms for video copy-move forgery detection
2020, Information Sciences
Citation Excerpt :
There are also two categories of the video content forgery: 1) Forgery content insertion from heterologous source, namely content splicing forgery, is that the copied contents and the pasted frames are originated from the different video sources [1]. The detection schemes [2,3,4,5], e.g., frame manipulation detector [4], recurrent neural networks (RNN) [5], have focused on the inconsistency of the statistical characteristics in the splicing frame and achieved the satisfying results. 2) Another popular video content forgery is video copy-move forgery.
This paper proposes a video copy-move forgery detection method to effectively address inter/intra-frame forgeries both at the frame and pixel level. First, a unified moment framework is proposed to extract multi-dimensional dense moment features from the video effectively. Second, a novel feature representation method takes each feature sub-map index to represent its every dimensional feature and then concatenates to a 9-digit dense moment feature index. Third, an inter-frame best match algorithm is proposed to search the 9-digit dense moment feature index of each pixel to find its best matches. All the best matches construct the best match map. Fourth, an inter-frame post-processing algorithm identifies the inter-frame forgery video in the best match map firstly and then indicates the corresponding inter-frame forgery regions. Otherwise, the intra-frame post-processing algorithm re-searches the best match of every pixel in each independent frame and then indicates the intra-frame forgery regions. If the video does not belong to the intra-frame forgeries, the video is determined as a genuine one. The experimental results show that the proposed method is effective at addressing the forensics of the genuine/forgery video and locating the inter/intra-frame copy-move forgeries both at the frame and pixel level.
Recent Advances in Digital Image and Video Forensics, Anti-forensics and Counter Anti-forensics
2024, arXiv
Deep Convolutional Neural Network for Robust Detection of Object-Based Forgeries in Advanced Video
2024, IEEE Access
A comprehensive taxonomy on multimedia video forgery detection techniques: challenges and novel trends
2024, Multimedia Tools and Applications

View all citing articles on Scopus

View full text

A passive approach for effective detection and localization of region-level video forgery with spatio-temporal coherence analysis

Abstract

Introduction

Section snippets

Overview

Frame grouping and alignment

Spatio-temporal coherence analysis

Tampered slice detection and region localization

Experimental results

Discussion and remarks

Conclusion

Inf Sci

Signal Process Image Commun

Spatiotemporal energy models for the perception of motion

J Opt Soc Am Opt Image Sci

Region filling and object removal by exemplar-based image inpainting

IEEE Transactions Image Process

Digital watermarking and fundamentals

Exposing digital forgeries in ballistic motion

IEEE Transactions Information Forensics Secur

A survey of image forgery detection

IEEE Signal Process Mag

TalA. Context-aware saliency detection

Digital image processing

Scene completion using millions of photographs

ACM Transactions Graph

Video forgery detection using correlation of noise residue

Proc. of IEEE Int. Conf. on multimedia signal processing

Source video camera identification for multiply compressed videos originating from YouTube

Digit Investig

Image completion using efficient belief propagation via priority scheduling and dynamic pruning

IEEE Transactions Image Process

Graphcut textures image and video synthesis using graphcuts

ACM Transactions Graph

Impeding forgers at photo inception

Detecting forgery from static-scene video based on inconsistency in noise level functions

IEEE Transactions Information Forensics Secur

Virtual contour guided video object inpainting using posture mapping and retrieval

IEEE Transactions Multimedia

Detecting frame duplication based on spatial and temporal analyses

Passive approach for video forgery detection and localization