Elsevier

Digital Investigation

Volume 11, Issue 2, June 2014, Pages 120-140
Digital Investigation

A passive approach for effective detection and localization of region-level video forgery with spatio-temporal coherence analysis

https://doi.org/10.1016/j.diin.2014.03.016Get rights and content

Abstract

In this paper, we present a passive approach for effective detection and localization of region-level forgery from video sequences possibly with camera motion. As most digital image/video capture devices do not have modules for embedding watermark or signature, passive forgery detection which aims to detect the traces of tampering without embedded information has become the major focus of recent research. However, most of current passive approaches either work only for frame-level detection and cannot localize region-level forgery, or suffer from high false detection rates for localization of tampered regions. In this paper, we investigate two common region-level inpainting methods for object removal, temporal copy-and-paste and exemplar-based texture synthesis, and propose a new approach based on spatio-temporal coherence analysis for detection and localization of tampered regions. Our approach can handle camera motion and multiple object removal. Experiments show that our approach outperforms previous approaches, and can effectively detect and localize regions tampered by temporal copy-and-paste and texture synthesis.

Introduction

Visual imagery has been widely used to provide essential evidence in many diverse areas, ranging from mainstream media, journalism, and scientific publication, to medical imaging, criminal investigations, and surveillance systems, to name a few. While we have historically had confidence with the integrity and authenticity of visual imagery, such trust has been gradually lost. With the rapid growth of digital devices and multimedia editing technology (Perez et al., 2003; Kwatra et al., 2003, Criminisi et al., 2004, Shen et al., 2006; Hays and Efros, 2007, Komodakis and Tziritas, 2007, Patwardhan Kedar et al., 2007, Ling et al., 2011), it has become easier than ever to produce and modify digital videos with increasing sophistication. Doctored videos are very difficult, if not impossible, to identify through visual examination. Therefore, digital video forensics, which aims to verify the trustworthiness of digital video, has become an important and exciting field of recent research.

There are two types of digital image/video forensics: active and passive. In active approaches, a watermark, which provides information to verify the integrity and authenticity of digital images/videos, is inserted into an image (or a video frame) while it is acquired (Cox et al., 2002, Guo et al., 2006, Martino and Sessa, 2012). Unfortunately, many digital image capture devices do not contain the module to insert watermarks. Therefore, passive approaches, which aim to detect traces of tampering without using prior information, are extensively studied in recent research (Farid, 2009, Mahdian and Saic, 2010, Milani et al., 2012).

Over the past few years, a number of passive approaches have been proposed, which can be roughly classified into four categories, pixel-based, format-based, camera-based, and geometric-based. Pixel-based approaches examine pixel level anomalies caused by tampering, such as correlations between frames arising from duplicate frames (Wang and Farid, 2007a, Lin et al., 2011) and tampered regions (Zhang et al., 2009, Lin and Tsay, 2013, Kirchner et al., 2013). Format based approaches exploit the unique properties of video compression, such as periodic properties (Wang and Farid, 2006, Wang and Farid, 2009, Sun et al., 2012) and blocking artifacts (Luo et al., 2008) in MPEG-1 and MPEG-2 videos. Camera-based approaches analyze the specific sensor artifacts caused by components in the imaging pipeline, such as sensor noise (Mondaini et al., 2007, Hsu et al., 2007, Houten and Geradts, 2009, Kobayashi et al., 2010) and interlaced scanning (Wang and Farid, 2007b). Geometric-based approaches inspect the geometric properties of objects and their positions relative to the camera (Conotter et al., 2012).

Previous research has studied the problem of frame duplication detection (Wang and Farid, 2007a, Lin et al., 2011), MPEG based forgery detection (Wang and Farid, 2006, Wang and Farid, 2009, Luo et al., 2008, Sun et al., 2012), and localization of tampered areas (Hsu et al., 2007, Zhang et al., 2009, Kirchner et al., 2013, Lin and Tsay, 2013). It has been noticed that temporal correlation information is very useful for detection of tampering for video sequences as temporal correlation are often destroyed when successive frames are tampered. Notice that most of previous approaches are effective mainly for detection of malicious manipulation. How to localize tampered areas is not either addressed or effectively solved by them.

For pixel-based approaches, Wang and Farid (2007a) used the high correlation between original and forged regions to detect copy-and-paste forgery. However, as high correlation is common in natural videos, their method does not work well if copied regions are taken from other videos. Lin et al. (2011) proposed a coarse-to-fine-grained approach which uses the difference in color histograms of adjacent frames to identify candidate clips from the temporal domain, and uses the block-based correlation algorithm in the spatial domain to measure the similarities between the query clip and candidate clip, in order to detect the duplicated clips. Since their approach performs only frame-level detection, it cannot localize region-level forgery. Zhang et al. (2009) proposed an approach which uses ghost shadow artifacts introduced by inpainting to detect forged regions. However, their approach is vulnerable to the effects of noise such as illumination changes, and cannot accurately locate the tampered areas in each frame.

For format-based approaches, Wang and Farid (2006) used spatial and temporal artifacts of double MPEG compression. An MPEG video sequence in I-frame is similar to a sequence of JPEG compressed images; however, there is more correlation among frames in a given Group of Pictures (GOP). Thus, detection of I-frame double compression is similar to JPEG double compression detection, and in a GOP, adding or deleting a frame will increase the error of motion estimation. However, their work is only effective in frame tamper detection, and cannot locate tampered regions. Luo et al. (2008) extended their image forensic method proposed in Luo et al. (2007), and used the temporal patterns of blocking artifacts to detect whether an MPEG video has suffered frame removal or addition before recompression. Both approaches are frame-level, and do not address the issue of region-level tampering detection and localization. For camera-based approaches, Hsu et al. (2007) used block based correlation of noise residue to detect and locate forged regions in a video. This method is based on the observation that tampering will alter the noise residue correlation, rendering it different from that of non-tampered areas. However, the correlation of noise residue is unstable for videos captured by moving camera and sensitive to quantization noise. Wang and Farid (2007b) examined the consistency of de-interlacing parameters which are used to convert an interlaced video into a non-interlaced video. Since interlaced videos contain the half of vertical resolution of the original videos, the de-interlacing procedure exploits interpolation, insertion, and duplication of frames to produce full-resolution videos. While their method can detect tampered regions in a video, it limits the form of the video, that is, de-interlaced or interlaced. Kobayashi et al. (2010) used photon shot noise to detect forged regions. Their work exploits the inconsistencies of photon shot noise caused by different video cameras to detect forged regions. However, their approach can only detect forgery in static-scene videos. Thus, it cannot be applied to detect tampered regions from videos captured by moving cameras.

In this paper, we study the problem of detection and localization of tampered regions resulting from removal of unwanted objects from videos, and present an effective and robust approach based on spatio-temporal coherence analysis. Our approach can handle videos with camera motion, and effectively localize the regions in video sequences tampered by well known inpainting methods such as temporal copy-and-paste and exemplar-based texture synthesis. To handle camera motion, we first calculate frame motion vectors between all pairs of successive frames, and use them to partition the video sequence into groups so that motion within the same group is small enough to be neglected in coherence analysis. We then align successive frames in the same group according to relative motions among them. After frame alignment, spatial-temporal slices are analyzed to detect whether the video has suffered unnaturally high or abnormally low coherence in successive frames. Finally, we perform a detection and localization process which detects and localizes tampered regions.

We have carried out experiments over major inpainting methods, such as temporal copy-and-paste and exemplar-based texture synthesis which are used to fill holes left by removal of unwanted objects in the video. The result shows that the proposed approach outperforms the ADI approach proposed in Zhang et al. (2009), and can effectively detect and localize tampered areas for videos from either static or moving cameras. It should be noted that a preliminary version of the proposed approach has been sketched in Lin and Tsay (2013). This paper is the complete version which elaborates all details, carries out more extensive experiments and makes significant improvement.

The remainder of this paper is organized as follows. Section Overview briefly overviews the problem, and sketches our main approach. Section Frame grouping and alignment gives the details of frame grouping and alignment. Section Spatio-temporal coherence analysis gives the details of spatio-temporal coherence analysis. Section Tampered slice detection and region localization gives the details of tampered slice detection and region localization. Section Experimental results gives experimental results, and improvement for compressed videos. Section Conclusion concludes.

Section snippets

Overview

In this section, we briefly overview the spatio-temporal slice technique, and sketch our proposed approach for detection and localization of tampered regions caused by object removal in video sequences.

Spatio-temporal slice technique is widely used in analyzing the spatio and temporal relationships in video sequences (Adelson and Bergen, 1985), and is commonly used in various research areas, such as motion analysis and segmentation (Ngo et al., 2003), human gait analysis (Niyogi and Adelson,

Frame grouping and alignment

A video of length L is a sequence of L frames f1, …, fL. Each frame ft, t  [1, L], is an image of M × N pixels. To deal with camera motion, we first compute camera motions between all-pairs of successive frames. We then partition the video sequence into frame groups so that the camera motion within each group is smaller than a predefined threshold. Frames in the same group are then aligned to form a 3D volume for spatio-temporal coherence analysis.

The camera motion between successive frames ft-1

Spatio-temporal coherence analysis

Spatio-temporal coherence analysis is performed over each frame group independently. The objective is to produce a group coherence abnormality pattern (GCAP) for each frame group, which can be used in subsequent steps to identify regions with unnaturally high or abnormally low coherence. For simplicity, we assume the video sequence consists of only one frame group in the following discussion. Otherwise, we simply repeat the following analysis for each group.

For each spatio-temporal slice, we

Tampered slice detection and region localization

In this section, we describe how to determine if a spatio-temporal slice contains tampered regions, and how to localize tampered regions in tampered slices.

Consider spatio-temporal slices of the 3D volume formed by all frames in some frame group. Each spatio-temporal slice is compared to its group coherence abnormality pattern GCAP to determine if it is tampered. In the following discussion, we explained how to identify tampered regions, using GCAP.

Let sim(SHMi, GCAP) and diff(SHMi, GCAP) be

Experimental results

In this section, we measure the performance of our approach over 18 test videos, and present comparison to the ADI approach proposed in (Zhang et al., 2009). We further give an improvement of our approach for videos which are saved in compressed formats after being tampered.

Discussion and remarks

In real circumstances, object removal from the scene is one of the major attacks in applications such as, for example, the video surveillance. Furthermore, to create plausible tampered videos, TCP and ETS are the common techniques used for filling the holes left by removal of unwanted objects. How to detect and localize tampered regions caused by object removal is an important task for digital forensic practitioners. Although a number of approaches have been proposed, most of them work for

Conclusion

In this paper, we have investigated two common region-level inpainting methods, temporal copy-and-paste and exemplar-based texture synthesis, for filling the holes left by object removal, and proposed a new approach based on spatio-temporal coherence analysis for passive detection and localization of tampered regions. Our approach handles camera motion by frame grouping and alignment. Experiments show that our approach can effectively detect and localize regions manipulated by temporal

References (39)

  • H. Guo et al.

    A fragile watermarking scheme for detecting malicious modifications of database relations

    Inf Sci

    (2006)
  • B. Mahdian et al.

    A bibliography on blind methods for identifying image forgery

    Signal Process Image Commun

    (2010)
  • E.H. Adelson et al.

    Spatiotemporal energy models for the perception of motion

    J Opt Soc Am Opt Image Sci

    (1985)
  • A. Criminisi et al.

    Region filling and object removal by exemplar-based image inpainting

    IEEE Transactions Image Process

    (2004)
  • I.J. Cox et al.

    Digital watermarking and fundamentals

    (2002)
  • V. Conotter et al.

    Exposing digital forgeries in ballistic motion

    IEEE Transactions Information Forensics Secur

    (2012)
  • H. Farid

    A survey of image forgery detection

    IEEE Signal Process Mag

    (2009)
  • S. Goferman et al.

    TalA. Context-aware saliency detection

  • R.C. Gonzalez et al.

    Digital image processing

    (2008)
  • J. Hays et al.

    Scene completion using millions of photographs

    ACM Transactions Graph

    (2007)
  • C.-C. Hsu et al.

    Video forgery detection using correlation of noise residue

    Proc. of IEEE Int. Conf. on multimedia signal processing

    (2007)
  • W.V. Houten et al.

    Source video camera identification for multiply compressed videos originating from YouTube

    Digit Investig

    (2009)
  • N. Komodakis et al.

    Image completion using efficient belief propagation via priority scheduling and dynamic pruning

    IEEE Transactions Image Process

    (2007)
  • V. Kwatra et al.

    Graphcut textures image and video synthesis using graphcuts

    ACM Transactions Graph

    (2003)
  • M. Kirchner et al.

    Impeding forgers at photo inception

  • M. Kobayashi et al.

    Detecting forgery from static-scene video based on inconsistency in noise level functions

    IEEE Transactions Information Forensics Secur

    (2010)
  • C.-H. Ling et al.

    Virtual contour guided video object inpainting using posture mapping and retrieval

    IEEE Transactions Multimedia

    (2011)
  • G.-S. Lin et al.

    Detecting frame duplication based on spatial and temporal analyses

  • C.-S. Lin et al.

    Passive approach for video forgery detection and localization

  • Cited by (69)

    • Dense moment feature index and best match algorithms for video copy-move forgery detection

      2020, Information Sciences
      Citation Excerpt :

      There are also two categories of the video content forgery: 1) Forgery content insertion from heterologous source, namely content splicing forgery, is that the copied contents and the pasted frames are originated from the different video sources [1]. The detection schemes [2,3,4,5], e.g., frame manipulation detector [4], recurrent neural networks (RNN) [5], have focused on the inconsistency of the statistical characteristics in the splicing frame and achieved the satisfying results. 2) Another popular video content forgery is video copy-move forgery.

    View all citing articles on Scopus
    View full text