Optical flow estimation using channel attention mechanism and dilated convolutional neural networks
Introduction
Optical flow estimation is a classical computer vision problem that is concerned with estimating pixel-level motion fields from two adjacent images. Traditional methods [1], [2], [3], [4], [5] usually build an energy function using prior knowledge, such as brightness constancy and spatial smoothness assumptions. We can regard these methods as knowledge-driven methods. In these works, optical flow can be estimated by minimizing the pre-defined energy function. However, knowledge-driven approaches only use prior constraints to capture the relationship between images and flow. Moreover, these methods cannot learn weights from large amount of data and most of them are time consuming for real applications.
Recently, convolutional neural networks have made rapid progress in many computer vision tasks, such as image classification [6], object recognition [7], semantic segmentation [8], depth estimation [9], and person re-identification [10]. Learning optical flow based on convolutional neural networks is first proposed by Dosovitskiy et al. [11], which designs a novel network named FlowNet based on encoder-decoder architecture. Simultaneously, a synthetic dataset named FlyingChairs is published for training and test. Based on [11], Ilg et al. further design a larger network named FlowNet2.0 [12], which connects several sub-networks and uses warping operation between each sub-network for iterative refinement. Additionally, the FlowNet2.0 uses FlyingThings3D dataset published in [13] to fine-tune each sub-network. Although the FlowNet2.0 achieves high accuracy on several benchmarks, the training process is very complicated and memory consuming. Moreover, the model size of [12] is about 5 times bigger than [11]. Another approach, SpyNet [14] combines spatial pyramid network with optical flow estimation, which transforms the second image to the first image using the upsampled flow and calculates the incremental flow at different levels. Compared to [11], [12], [14] achieves fewer parameters and is faster. However, [14] cannot compete with [12] in accuracy and the EPE is lower than [11] slightly.
Existing networks treat each channel-wise feature equally, lacking the ability to distinguish the importance of channel-wise features, and impeding the representational ability of networks. Moreover, current approaches usually use strided convolution and deconvolutional layer to reduce or enlarge the size of features, which results in the loss of spatial information and further limits the dense estimation task.
To address the problem of feature recalibration, we introduce channel attention unit and dilated convolution into network for optical flow estimation. Most recent work [15] introduces a channel-wise attention network for image classification, which can recalibrate channel-wise feature responses by explicitly modelling interdependencies between channels. In our network, we adopt channel attention module to adaptively learn more useful channel-wise features. In order to solve the problem of spatial information loss and the limitation of perception field, we introduce dilated convolution into our network. The dilated convolution is widely used in pixel-level tasks, such as semantic segmentation and image super-resolution. In these works, the dilated convolution is employed to keep the resolution of features. The dilated convolution has two advantages. (1) It can enlarge the receptive fields of convolutional kernel. (2) The resolution of feature map can be kept without using large computational source. Due to the mentioned advantages, we employ the dilated convolution in our network for producing more sharper flow fields.
In this paper, we propose a novel network for optical flow estimation, which introduces channel attention mechanism and dilated convolution into learning optical flow. Our contributions are summarized as follows:
- 1.
To adaptively make the network focus on more informative features and learn more useful channel-wise features, we exploit the interdependencies among feature channels and embed channel attention unit into our network, which can enhance the representational ability of deep convolutional neural network. To the best of our knowledge, we are the first to combine channel attention mechanism with learning optical flow.
- 2.
In order to enlarge the receptive field without increasing filter size and exploit more spatial information in an efficient way, we introduce the dilated convolution into our network, which has been proven to be effective for optical flow estimation in our early work [16]. In this paper, we further design a cascade attention and dilated convolution module to improve the accuracy of flow estimation effectively.
In addition, we employ prior multi-constraint loss proposed in our previous work [17] to further improve the accuracy, which combines the supervised term with prior constraints used in knowledge-driven methods.
Section snippets
Related work
In Section-2-A, we first introduce the knowledge-driven methods briefly. In Section-2-B, we mainly discuss the data-driven methods. In Section-2-C, we review the computer vision tasks that use the attention mechanism. In Section-2-D, we describe dilated convolution based methods proposed in some computer vision tasks.
Network architecture
The entire architecture of our network is shown in Fig. 1, which is based on encoder-decoder architecture. Fig. 2 shows the contracting part of the proposed network. As shown in Fig. 2, first, the two adjacent images I1 and I2 are fed into feature extractor that contains three standard convolutional layers followed ReLU with stride of 2. And the sizes of convolutional kernels are set to 7*7, 5*5 and 3*3. The outputs of feature extractor are F1 and F2, respectively. Given two adjacent images,
Experimental results
In this section, we mainly describe the training process and evaluate our method on MPI-Sintel and KITTI datasets. We compared our method with knowledge-driven methods and data-driven methods. The experimental results verified the effectiveness of our proposed approach.
Conclusion
In this paper, we propose a novel network for optical flow estimation, which introduces the channel attention module and dilated convolution into learning optical flow. The channel attention module can adaptively recalibrate channel-wise features by considering relationship among channels and can further improve the representational ability of the network. Moreover, the network can learn the weights of feature map and focus on more useful features. In addition, for dense estimation task, the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (No. 61401113), the Natural Science Foundation of Heilongjiang Province of China (No. LC201426), the Fundamental Research Funds for the Central Universities of China (No. 3072019CF0801), and the Ph.D. Student Research and Innovation Fund of the Fundamental Research Funds for the Central Universities (No. 3072019GIP0807).
Mingliang Zhai was born in Xining, China, in 1994. He received the B.Eng. degree from the JiLin University, China, in 2016. His research interests include image processing, computer vision and pattern recognition. He is currently pursuing a doctorate at Harbin Engineering University.
References (43)
- et al.
Determining optical flow
Artif. Intell.
(1981) - et al.
Coarse to over-fine optical flow estimation
Pattern Recognit.
(2007) - et al.
Variational method for joint optical flow estimation and edge-aware image restoration
Pattern Recognit.
(2017) - et al.
Multi-modal self-paced learning for image classification
Neurocomputing
(2018) - et al.
Multi-scale pyramid pooling network for salient object detection
Neurocomputing
(2019) - et al.
Multimodality semantic segmentation based on polarization and color images
Neurocomputing
(2017) - et al.
A jointly learned deep embedding for person re-identification
Neurocomputing
(2019) - et al.
Flownet: learning optical flow with convolutional networks
Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)
(2015) - et al.
Image super-resolution using very deep residual channel attention networks
Proceedings of the Computer Vision – ECCV 2018
(2018) - et al.
Secrets of optical flow estimation and their principles
Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
(2010)
Large displacement optical flow: descriptor matching in variational motion estimation
IEEE Trans. Pattern Anal. Mach. Intell.
Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference
Pattern Recognit.
Flownet 2.0: evolution of optical flow estimation with deep networks
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Optical flow estimation using a spatial pyramid network
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Squeeze-and-excitation networks
Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Learning optical flow using deep dilated residual networks
IEEE Access
Deep optical flow supervised learning with prior assumptions
IEEE Access
Joint coarse-and-fine reasoning for deep optical flow
Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP)
Unsupervised convolutional neural networks for motion estimation
Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP)
Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness
Proceedings of the Workshops on Computer Vision – ECCV 2016
Cited by (0)
Mingliang Zhai was born in Xining, China, in 1994. He received the B.Eng. degree from the JiLin University, China, in 2016. His research interests include image processing, computer vision and pattern recognition. He is currently pursuing a doctorate at Harbin Engineering University.
Xuezhi Xiang was born in Harbin, China, in 1979. He received the B.Eng. degree in information engineering, and the M.Sc. and Ph.D. degrees in signal and information processing from Harbin Engineering University, China, in 2002, 2004, and 2008, respectively. He was a Post-Doctoral Fellow with the Harbin Institute of Technology from 2009 to 2011. From 2011 to 2012, he was a Visiting Scholar with the University of Ottawa. Since 2010, he has been an Associate Professor with the School of Information and Communication Engineering, Harbin Engineering University. He has authored over 40 articles. His research interests include image processing, computer vision, and pattern recognition, etc. Dr. Xiang is also a member of the Association for Computing Machinery and a Senior Member of the China Computer Federation.
Rongfang Zhang was born in Daqing, China, in 1993. She received the B.Eng. degree in communication engineering from Harbin engineering University, China, in 2017. Her research interests include image processing, computer vision, and pattern recognition.
Ning Lv was born in Yingkou, China, in 1994. She received the B.Eng. Degree in communication engineering from Shandong University, China, in 2016. Her research interests include computer vision, and pattern recognition.
Abdulmotaleb El Saddik (F’09) is Distinguished University Professor and University Research Chair in the School of Electrical Engineering and Computer Science at the University of Ottawa. His research focus is on multimodal interactions with sensory information in smart cities. He is senior Associate Editor among others of the ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP currently TOMM), IEEE Transactions on Multimedia (IEEE TMM), and Guest Editor for several IEEE Transactions and Journals. He has authored and co-authored four books and more than 550 publications and chaired more than 50 conferences and workshop. He has received research grants and contracts totalling more than $18 M. He has supervised more than 120 researchers and received several international awards, among others, are ACM Distinguished Scientist, Fellow of the Engineering Institute of Canada, Fellow of the Canadian Academy of Engineers and Fellow of IEEE, IEEE I & M Technical Achievement Award and IEEE Canada Computer Medal.