ABSTRACT
Predicting the walking path of a pedestrian in crowds is a pivotal step towards understanding his/her behavior. This is one of the recently emerging tasks in computer vision scarcely addressed to date. In this paper, we put forth a deep spatio-temporal learning-forecasting approach, which is composed of two modules. First, displacement information from pedestrians' walking history is extracted and fed into a convolutional layer in order to learn the undergoing motion patterns and produce high-level representations. Second, unlike the mainstream literature which learns the temporal or the spatial dynamics among the pedestrians separately, we propose to embed both components into a single framework via a Long-Short Term Memory based architecture that takes as input the previously extracted high-level motion cues and outputs the potential future walking routes of all pedestrians in one shot. We evaluate our approach on three large benchmark datasets, and show that it introduces large margin improvements with respect to recent works in the literature, both in short and long-term forecasting scenarios.
- Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. 2016. Social lstm: Human trajectory prediction in crowded spaces Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961--971.Google Scholar
- Lamberto Ballan, Francesco Castaldo, Alexandre Alahi, Francesco Palmieri, and Silvio Savarese. 2016. Knowledge Transfer for Scene-specific Motion Prediction European Conference on Computer Vision. Springer, 697--713.Google Scholar
- Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. 2016. Whatrqs the point: Semantic segmentation with point supervision European Conference on Computer Vision. Springer, 549--565.Google Scholar
- R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning BigLearn, NIPS Workshop.Google Scholar
- Zhiwei Deng, Arash Vahdat, Hexiang Hu, and Greg Mori. 2016. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4772--4781.Google ScholarCross Ref
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description Proceedings of the IEEE conference on computer vision and pattern recognition. 2625--2634.Google Scholar
- Bo Du, Wei Xiong, Jia Wu, Lefei Zhang, Liangpei Zhang, and Dacheng Tao. 2017. Stacked convolutional denoising auto-encoders for feature representation. IEEE transactions on cybernetics Vol. 47, 4 (2017), 1017--1027.Google ScholarCross Ref
- Bob Givan and Ron Parr. 2001. An introduction to Markov decision processes. (2001).Google Scholar
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Google ScholarDigital Library
- Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013).Google Scholar
- Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks Proceedings of the 31st International Conference on Machine Learning (ICML-14). 1764--1772. Google ScholarDigital Library
- Fan Hu, Gui-Song Xia, Jingwen Hu, and Liangpei Zhang. 2015. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sensing, Vol. 7, 11 (2015), 14680--14707.Google ScholarCross Ref
- Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-RNN: Deep learning on spatio-temporal graphs Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5308--5317.Google Scholar
- Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, and Martial Hebert. 2012. Activity forecasting. In European Conference on Computer Vision. Springer, 201--214. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105. Google ScholarDigital Library
- Namhoon Lee and Kris M Kitani. 2016. Predicting wide receiver trajectories in American football WACV. IEEE, 1--9.Google Scholar
- Nicholas Léonard, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim. 2015. rnn: Recurrent library for torch. arXiv preprint arXiv:1511.07889 (2015).Google Scholar
- Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. 2007. Crowds by example Computer Graphics Forum, Vol. Vol. 26. Wiley Online Library, 655--664.Google Scholar
- Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos Anomaly detection in crowded scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google Scholar
- Andrew Y Ng and Stuart Russell. 2000. Algorithms for Inverse Reinforcement Learning. in Proc. 17th International Conf. on Machine Learning. Google ScholarDigital Library
- Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc Van Gool. 2009. You'll never walk alone: Modeling social behavior for multi-target tracking ICCV. IEEE, 261--268.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99. Google ScholarDigital Library
- Jing Shao, Kai Kang, Chen Change Loy, and Xiaogang Wang. 2015. Deeply learned attributes for crowded scene understanding Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4657--4666.Google Scholar
- Jing Shao, Chen Change Loy, and Xiaogang Wang. 2014. Scene-independent group profiling in crowd. In CVPR. Google ScholarDigital Library
- Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).Google Scholar
- Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2016. Robust scene text recognition with automatic rectification Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4168--4176.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos Advances in neural information processing systems. 568--576. Google ScholarDigital Library
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised Learning of Video Representations using LSTMs Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 843--852. Google ScholarDigital Library
- Hang Su, Yinpend Dong, Jun Zhu, Haibin Lin, and Bo Zhang. 2016. Crowd Scene Understanding with Coherent Recurrent Neural Networks Proceedings of the IJCAI 2016. 3469--3476. http://www.ijcai.org/Abstract/16/490;http://dblp.uni-trier.de/rec/bib/conf/ijcai/SuDZLZ16 Google ScholarDigital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarCross Ref
- Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research Vol. 11, Dec (2010), 3371--3408. Google ScholarDigital Library
- Jacob Walker, Abhinav Gupta, and Martial Hebert. 2014. Patch to the future: Unsupervised visual prediction 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3302--3309. Google ScholarDigital Library
- Shi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting Advances in Neural Information Processing Systems. 802--810. Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2048--2057. Google ScholarDigital Library
- Shuai Yi, Hongsheng Li, and Xiaogang Wang. 2015 a. Pedestrian Travel Time Estimation in Crowded Scenes IEEE International Conference on Computer Vision (ICCV). IEEE. Google ScholarDigital Library
- Shuai Yi, Hongsheng Li, and Xiaogang Wang. 2015 b. Understanding Pedestrian Behaviors from Stationary Crowd Groups Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE.Google Scholar
- Shuai Yi, Hongsheng Li, and Xiaogang Wang. 2016. Pedestrian Behavior Understanding and Prediction with Deep Neural Networks European Conference on Computer Vision. Springer, 263--279.Google Scholar
- YoungJoon Yoo, Kimin Yun, Sangdoo Yun, JongHee Hong, Hawook Jeong, and Jin Young Choi. 2016. Visual Path Prediction in Complex Scenes With Crowded Moving Objects The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. 2015. Cross-scene crowd counting via deep convolutional neural networks Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 833--841.Google Scholar
- Jianming Zhang, Zhe Lin, Jonathan Shen Xiaohui Brandt, and Stan Sclaroff. 2016. Top-down Neural Attention by Excitation Backprop European Conference on Computer Vision(ECCV).Google Scholar
- Bolei Zhou, Xiaoou Tang, and Xiaogang Wang. 2015. Learning collective crowd behaviors with dynamic pedestrian-agents. International Journal of Computer Vision Vol. 111, 1 (2015), 50--68. Google ScholarDigital Library
- Bolei Zhou, Xiaoou Tang, Hepeng Zhang, and Xiaogang Wang. 2014. Measuring Crowd Collectiveness. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 8 (2014), 1586--1599. Google ScholarDigital Library
- Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google Scholar
- Maryam Ziaeefard and Robert Bergevin. 2015. Semantic human activity recognition: a literature review. Pattern Recognition, Vol. 48, 8 (2015), 2329--2345. Google ScholarDigital Library
Index Terms
- Pedestrian Path Forecasting in Crowd: A Deep Spatio-Temporal Perspective
Recommendations
Pedestrian Detection Algorithm Based on ViBe and YOLO
ICVIP '21: Proceedings of the 2021 5th International Conference on Video and Image ProcessingAs more and more monitoring devices are deployed in various cities around the world, the technology of intelligent analysis and processing of video image data based on the computer is becoming more and more mature. This paper adopts an algorithm based ...
Context-aware pedestrian detection especially for small-sized instances with Deconvolution Integrated Faster RCNN (DIF R-CNN)
Pedestrian detection is a canonical problem in computer vision. Motivated by the observation that the major bottleneck of pedestrian detection lies on the different scales of pedestrian instances in images, our effort is focused on improving the ...
Robust pedestrian detection using scale and illumination invariant Mask R-CNN
In this paper, we address the challenging difficulty of detecting pedestrians with variation in scale and the illumination of the images. Occurrences of pedestrians with such variations exhibit diverse features. Therefore, it intensely affects the ...
Comments