Elsevier

Neurocomputing

Volume 368, 27 November 2019, Pages 124-132
Neurocomputing

Optical flow estimation using channel attention mechanism and dilated convolutional neural networks

https://doi.org/10.1016/j.neucom.2019.08.040Get rights and content

Abstract

Learning optical flow based on convolutional neural networks has made great progress in recent years. These approaches usually design an encoder-decoder network that can be trained end-to-end. In encoder part, high-level feature information is extracted through a series of strided convolution, which is similar to most image classification networks. In contrast to classification task, spatial feature maps are then enlarged to full scale of input by conducting successive deconvolution layer in decoder part. However, optical flow estimation is a pixel-level task, and blurry flow fields are usually generated, which is caused by unrefined features and low-resolution. To address this problem, we propose a novel network, which combines attention mechanism and dilated convolutional neural network. In this network, the channel-wise features are adaptively weighted by building interdependencies among channels, which can weaken the weights of useless features and can enhance the directivity of feature extraction. Meanwhile, spatial precision is achieved by employing dilated convolution which improves the receptive field without large computational source and keeps the spatial resolution of feature map unchanged. Our network is trained on FlyingChairs and FlyingThings3D datasets in a supervised manner. Extensive experiments are conducted on MPI-Sintel and KITTI datasets to verify the effectiveness of the proposed method. The experimental results show that attention mechanism and dilated convolution are beneficial for optical flow estimation. Moreover, our method achieves better accuracy and visual improvements comparing to most of recent approaches.

Introduction

Optical flow estimation is a classical computer vision problem that is concerned with estimating pixel-level motion fields from two adjacent images. Traditional methods [1], [2], [3], [4], [5] usually build an energy function using prior knowledge, such as brightness constancy and spatial smoothness assumptions. We can regard these methods as knowledge-driven methods. In these works, optical flow can be estimated by minimizing the pre-defined energy function. However, knowledge-driven approaches only use prior constraints to capture the relationship between images and flow. Moreover, these methods cannot learn weights from large amount of data and most of them are time consuming for real applications.

Recently, convolutional neural networks have made rapid progress in many computer vision tasks, such as image classification [6], object recognition [7], semantic segmentation [8], depth estimation [9], and person re-identification [10]. Learning optical flow based on convolutional neural networks is first proposed by Dosovitskiy et al. [11], which designs a novel network named FlowNet based on encoder-decoder architecture. Simultaneously, a synthetic dataset named FlyingChairs is published for training and test. Based on [11], Ilg et al. further design a larger network named FlowNet2.0 [12], which connects several sub-networks and uses warping operation between each sub-network for iterative refinement. Additionally, the FlowNet2.0 uses FlyingThings3D dataset published in [13] to fine-tune each sub-network. Although the FlowNet2.0 achieves high accuracy on several benchmarks, the training process is very complicated and memory consuming. Moreover, the model size of [12] is about 5 times bigger than [11]. Another approach, SpyNet [14] combines spatial pyramid network with optical flow estimation, which transforms the second image to the first image using the upsampled flow and calculates the incremental flow at different levels. Compared to [11], [12], [14] achieves fewer parameters and is faster. However, [14] cannot compete with [12] in accuracy and the EPE is lower than [11] slightly.

Existing networks treat each channel-wise feature equally, lacking the ability to distinguish the importance of channel-wise features, and impeding the representational ability of networks. Moreover, current approaches usually use strided convolution and deconvolutional layer to reduce or enlarge the size of features, which results in the loss of spatial information and further limits the dense estimation task.

To address the problem of feature recalibration, we introduce channel attention unit and dilated convolution into network for optical flow estimation. Most recent work [15] introduces a channel-wise attention network for image classification, which can recalibrate channel-wise feature responses by explicitly modelling interdependencies between channels. In our network, we adopt channel attention module to adaptively learn more useful channel-wise features. In order to solve the problem of spatial information loss and the limitation of perception field, we introduce dilated convolution into our network. The dilated convolution is widely used in pixel-level tasks, such as semantic segmentation and image super-resolution. In these works, the dilated convolution is employed to keep the resolution of features. The dilated convolution has two advantages. (1) It can enlarge the receptive fields of convolutional kernel. (2) The resolution of feature map can be kept without using large computational source. Due to the mentioned advantages, we employ the dilated convolution in our network for producing more sharper flow fields.

In this paper, we propose a novel network for optical flow estimation, which introduces channel attention mechanism and dilated convolution into learning optical flow. Our contributions are summarized as follows:

  • 1.

    To adaptively make the network focus on more informative features and learn more useful channel-wise features, we exploit the interdependencies among feature channels and embed channel attention unit into our network, which can enhance the representational ability of deep convolutional neural network. To the best of our knowledge, we are the first to combine channel attention mechanism with learning optical flow.

  • 2.

    In order to enlarge the receptive field without increasing filter size and exploit more spatial information in an efficient way, we introduce the dilated convolution into our network, which has been proven to be effective for optical flow estimation in our early work [16]. In this paper, we further design a cascade attention and dilated convolution module to improve the accuracy of flow estimation effectively.

In addition, we employ prior multi-constraint loss proposed in our previous work [17] to further improve the accuracy, which combines the supervised term with prior constraints used in knowledge-driven methods.

Section snippets

Related work

In Section-2-A, we first introduce the knowledge-driven methods briefly. In Section-2-B, we mainly discuss the data-driven methods. In Section-2-C, we review the computer vision tasks that use the attention mechanism. In Section-2-D, we describe dilated convolution based methods proposed in some computer vision tasks.

Network architecture

The entire architecture of our network is shown in Fig. 1, which is based on encoder-decoder architecture. Fig. 2 shows the contracting part of the proposed network. As shown in Fig. 2, first, the two adjacent images I1 and I2 are fed into feature extractor that contains three standard convolutional layers followed ReLU with stride of 2. And the sizes of convolutional kernels are set to 7*7, 5*5 and 3*3. The outputs of feature extractor are F1 and F2, respectively. Given two adjacent images,

Experimental results

In this section, we mainly describe the training process and evaluate our method on MPI-Sintel and KITTI datasets. We compared our method with knowledge-driven methods and data-driven methods. The experimental results verified the effectiveness of our proposed approach.

Conclusion

In this paper, we propose a novel network for optical flow estimation, which introduces the channel attention module and dilated convolution into learning optical flow. The channel attention module can adaptively recalibrate channel-wise features by considering relationship among channels and can further improve the representational ability of the network. Moreover, the network can learn the weights of feature map and focus on more useful features. In addition, for dense estimation task, the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (No. 61401113), the Natural Science Foundation of Heilongjiang Province of China (No. LC201426), the Fundamental Research Funds for the Central Universities of China (No. 3072019CF0801), and the Ph.D. Student Research and Innovation Fund of the Fundamental Research Funds for the Central Universities (No. 3072019GIP0807).

Mingliang Zhai was born in Xining, China, in 1994. He received the B.Eng. degree from the JiLin University, China, in 2016. His research interests include image processing, computer vision and pattern recognition. He is currently pursuing a doctorate at Harbin Engineering University.

References (43)

  • T. Brox et al.

    Large displacement optical flow: descriptor matching in variational motion estimation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • LiB. et al.

    Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

    Pattern Recognit.

    (2018)
  • E. Ilg et al.

    Flownet 2.0: evolution of optical flow estimation with deep networks

    Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • N. Mayer et al.

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

    Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • A. Ranjan et al.

    Optical flow estimation using a spatial pyramid network

    Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • HuJ. et al.

    Squeeze-and-excitation networks

    Proceedings of the 2018  IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2018)
  • ZhaiM. et al.

    Learning optical flow using deep dilated residual networks

    IEEE Access

    (2019)
  • XiangX. et al.

    Deep optical flow supervised learning with prior assumptions

    IEEE Access

    (2018)
  • V. Vaquero et al.

    Joint coarse-and-fine reasoning for deep optical flow

    Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP)

    (2017)
  • A. Ahmadi et al.

    Unsupervised convolutional neural networks for motion estimation

    Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP)

    (2016)
  • YuJ.J. et al.

    Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness

    Proceedings of the Workshops on Computer Vision – ECCV 2016

    (2016)
  • Cited by (0)

    Mingliang Zhai was born in Xining, China, in 1994. He received the B.Eng. degree from the JiLin University, China, in 2016. His research interests include image processing, computer vision and pattern recognition. He is currently pursuing a doctorate at Harbin Engineering University.

    Xuezhi Xiang was born in Harbin, China, in 1979. He received the B.Eng. degree in information engineering, and the M.Sc. and Ph.D. degrees in signal and information processing from Harbin Engineering University, China, in 2002, 2004, and 2008, respectively. He was a Post-Doctoral Fellow with the Harbin Institute of Technology from 2009 to 2011. From 2011 to 2012, he was a Visiting Scholar with the University of Ottawa. Since 2010, he has been an Associate Professor with the School of Information and Communication Engineering, Harbin Engineering University. He has authored over 40 articles. His research interests include image processing, computer vision, and pattern recognition, etc. Dr. Xiang is also a member of the Association for Computing Machinery and a Senior Member of the China Computer Federation.

    Rongfang Zhang was born in Daqing, China, in 1993. She received the B.Eng. degree in communication engineering from Harbin engineering University, China, in 2017. Her research interests include image processing, computer vision, and pattern recognition.

    Ning Lv was born in Yingkou, China, in 1994. She received the B.Eng. Degree in communication engineering from Shandong University, China, in 2016. Her research interests include computer vision, and pattern recognition.

    Abdulmotaleb El Saddik (F’09) is Distinguished University Professor and University Research Chair in the School of Electrical Engineering and Computer Science at the University of Ottawa. His research focus is on multimodal interactions with sensory information in smart cities. He is senior Associate Editor among others of the ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP currently TOMM), IEEE Transactions on Multimedia (IEEE TMM), and Guest Editor for several IEEE Transactions and Journals. He has authored and co-authored four books and more than 550 publications and chaired more than 50 conferences and workshop. He has received research grants and contracts totalling more than $18 M. He has supervised more than 120 researchers and received several international awards, among others, are ACM Distinguished Scientist, Fellow of the Engineering Institute of Canada, Fellow of the Canadian Academy of Engineers and Fellow of IEEE, IEEE I & M Technical Achievement Award and IEEE Canada Computer Medal.

    View full text