ABSTRACT
As a vital topic in media content interpretation, video anomaly detection (VAD) has made fruitful progress via deep neural network (DNN). However, existing methods usually follow a reconstruction or frame prediction routine. They suffer from two gaps: (1) They cannot localize video activities in a both precise and comprehensive manner. (2) They lack sufficient abilities to utilize high-level semantics and temporal context information. Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. Appearance and motion are exploited as mutually complimentary cues to localize regions of interest (RoIs). A normalized spatio-temporal cube (STC) is built from each RoI as a video event, which lays the foundation of VEC and serves as a basic processing unit. Second, we encourage DNN to capture high-level semantics by solving a visual cloze test. To build such a visual cloze test, a certain patch of STC is erased to yield an incomplete event (IE). The DNN learns to restore the original video event from the IE by inferring the missing patch. Third, to incorporate richer motion dynamics, another DNN is trained to infer erased patches' optical flow. Finally, two ensemble strategies using different types of IE and modalities are proposed to boost VAD performance, so as to fully exploit the temporal context and modality information for VAD. VEC can consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks. Our codes and results can be verified at github.com/yuguangnudt/VEC_VAD
Supplemental Material
Available for Download
The supplementary material contains the supplementary PDF file to supplement the content that cannot be placed in the main body due to the 8 page limit. The supplementary PDF file provides the content below: 1. Frame-level Equal Error Rate (EER) comparison. 2. VEC's completion error histograms. 3. VEC's computational cost. 4. VEC's parameter sensitivity analysis. 5. More visualization results of VEC.
- Davide Abati, Angelo Porrello, Simone Calderara, and Rita Cucchiara. 2019. Latent space autoregression for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 481--490.Google ScholarCross Ref
- Borislav Antić and Björn Ommer. 2011. Video parsing for abnormality detection. In 2011 International Conference on Computer Vision. IEEE, 2415--2422.Google ScholarDigital Library
- Arslan Basharat, Alexei Gritai, and Mubarak Shah. 2008. Learning object motion patterns for anomaly detection and improved object detection. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarCross Ref
- Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6154--6162.Google ScholarCross Ref
- Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR), Vol. 41, 3 (2009), 15.Google Scholar
- Kai-Wen Cheng, Yie-Tarng Chen, and Wen-Hsien Fang. 2015. Video anomaly detection and localization using hierarchical feature representation and Gaussian process regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2909--2917.Google ScholarCross Ref
- Yang Cong, Junsong Yuan, and Ji Liu. 2011. Sparse reconstruction cost for abnormal event detection. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3449--3456.Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.Google Scholar
- Thomas G Dietterich. 2000. Ensemble methods in machine learning. In International workshop on multiple classifier systems. Springer, 1--15.Google ScholarDigital Library
- Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. 2019. Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. 2016. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition. 733--742.Google ScholarCross Ref
- Ryota Hinami, Tao Mei, and Shin'ichi Satoh. 2017. Joint detection and recounting of abnormal events by learning deep generic knowledge. In Proceedings of the IEEE International Conference on Computer Vision. 3619--3627.Google ScholarCross Ref
- Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2462--2470.Google ScholarCross Ref
- Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. 2019. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7842--7851.Google ScholarCross Ref
- Shehroz S Khan and Michael G Madden. 2014. One-class classification: taxonomy of study and review of techniques. The Knowledge Engineering Review, Vol. 29, 3 (2014), 345--374.Google ScholarCross Ref
- Louis Kratz and Ko Nishino. 2009. Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1446--1453.Google ScholarCross Ref
- Long Lan, Xinchao Wang, Gang Hua, Thomas S Huang, and Dacheng Tao. 2020. Semi-online Multi-people Tracking by Re-identification. International Journal of Computer Vision (2020), 1--19.Google Scholar
- Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. 2016. Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning. 1558--1566.Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, Vol. 521, 7553 (2015), 436--444.Google Scholar
- Kun Liu and Huadong Ma. 2019. Exploring Background-bias for Anomaly Detection in Surveillance Videos. In Proceedings of the 27th ACM International Conference on Multimedia. 1490--1499.Google ScholarDigital Library
- Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection--a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6536--6545.Google ScholarCross Ref
- Cewu Lu, Jianping Shi, and Jiaya Jia. 2013. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision. 2720--2727.Google ScholarDigital Library
- Yiwei Lu, K Mahesh Kumar, Seyed shahabeddin Nabavi, and Yang Wang. 2019. Future Frame Prediction Using Convolutional VRNN for Anomaly Detection. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--8.Google Scholar
- Weixin Luo, Wen Liu, and Shenghua Gao. 2017a. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 439--444.Google ScholarCross Ref
- Weixin Luo, Wen Liu, and Shenghua Gao. 2017b. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision. 341--349.Google ScholarCross Ref
- Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1975--1981.Google ScholarCross Ref
- Ramin Mehran, Alexis Oyama, and Mubarak Shah. 2009. Abnormal crowd behavior detection using social force model. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20--25 June 2009, Miami, Florida, USA.Google ScholarCross Ref
- Romero Morais, Vuong Le, Truyen Tran, Budhaditya Saha, Moussa Mansour, and Svetha Venkatesh. 2019. Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11996--12004.Google ScholarCross Ref
- Trong-Nguyen Nguyen and Jean and Meunier. 2019. Anomaly Detection in Video Sequence with Appearance-Motion Correspondence. In Proceedings of the IEEE International Conference on Computer Vision. 1273--1283.Google ScholarCross Ref
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024--8035.Google Scholar
- Claudio Piciarelli, Christian Micheloni, and Gian Luca Foresti. 2008. Trajectory-based anomalous event detection. IEEE Transactions on Circuits and Systems for video Technology, Vol. 18, 11 (2008), 1544--1554.Google ScholarDigital Library
- Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe. 2017. Abnormal event detection in videos using generative adversarial nets. In 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 1577--1581.Google ScholarDigital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarCross Ref
- Mohammad Sabokrou, Mohsen Fayyaz, Mahmood Fathy, Zahra Moayed, and Reinhard Klette. 2018a. Deep-anomaly: Fully convolutional neural network for fast anomaly detection in crowded scenes. Computer Vision and Image Understanding, Vol. 172 (2018), 88--97.Google ScholarCross Ref
- Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli. 2018b. Adversarially learned one-class classifier for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3379--3388.Google ScholarCross Ref
- Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. 2020. Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters, Vol. 129 (2020), 123--130.Google ScholarCross Ref
- Hanh TM Tran and David Hogg. 2017. Anomaly detection using a convolutional winner-take-all autoencoder. In Proceedings of the British Machine Vision Conference 2017. British Machine Vision Association.Google ScholarCross Ref
- Wikipedia. 2019. Cloze test. https://en.wikipedia.org/wiki/Cloze_test.Google Scholar
- Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. In Proceedings of the British Machine Vision Conference. 8.1--8.8.Google ScholarCross Ref
- Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. 2017. Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding, Vol. 156 (2017), 117--127.Google ScholarDigital Library
- Shiyang Yan, Jeremy S Smith, Wenjin Lu, and Bailing Zhang. 2018. Abnormal Event Detection from Videos using a Two-stream Recurrent Variational Autoencoder. IEEE Transactions on Cognitive and Developmental Systems (2018).Google Scholar
- Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu Qiao. 2019. AnoPCN: Video Anomaly Detection via Deep Predictive Coding Network. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 1805--1813.Google ScholarDigital Library
- Jie Yin, Qiang Yang, and Jeffrey Junfeng Pan. 2008. Sensor-based abnormal human-activity detection. IEEE Transactions on Knowledge and Data Engineering, Vol. 20, 8 (2008), 1082--1090.Google ScholarDigital Library
- Tianzhu Zhang, Hanqing Lu, and Stan Z Li. 2009. Learning semantic scene models by object classification and trajectory clustering. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1940--1947.Google ScholarCross Ref
- Bin Zhao, Li Fei-Fei, and Eric P Xing. 2011. Online detection of unusual events in videos via dynamic sparse coding. In CVPR 2011. IEEE, 3313--3320.Google ScholarDigital Library
- Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. 2017. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1933--1941.Google ScholarDigital Library
- Joey Tianyi Zhou, Jiawei Du, Hongyuan Zhu, Xi Peng, Yong Liu, and Rick Siow Mong Goh. 2019 a. AnomalyNet: An anomaly detection network for video surveillance. IEEE Transactions on Information Forensics and Security (2019).Google Scholar
- Joey Tianyi Zhou, Le Zhang, Zhiwen Fang, Jiawei Du, Xi Peng, and Xiao Yang. 2019 b. Attention-Driven Loss for Anomaly Detection in Video Surveillance. IEEE Transactions on Circuits and Systems for Video Technology (2019).Google Scholar
Index Terms
- Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events
Recommendations
Spatio-Temporal AutoEncoder for Video Anomaly Detection
MM '17: Proceedings of the 25th ACM international conference on MultimediaAnomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial ...
SATJiP: Spatial and Augmented Temporal Jigsaw Puzzles for Video Anomaly Detection
Advances in Knowledge Discovery and Data MiningAbstractVideo Anomaly Detection (VAD) is a significant task, which refers to taking a video clip as input and outputting class labels, e.g., normal or abnormal, at the frame level. Wang et al. proposed a method called DSTJiP, which trains the model by ...
Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection
MM '21: Proceedings of the 29th ACM International Conference on MultimediaDetecting abnormal activities in real-world surveillance videos is an important yet challenging task as the prior knowledge about video anomalies is usually limited or unavailable. Despite that many approaches have been developed to resolve this problem,...
Comments