Top

Published in:

2023 | OriginalPaper | Chapter

CAENet: Efficient Multi-task Learning for Joint Semantic Segmentation and Depth Estimation

Authors : Luxi Wang, Yingming Li

Published in: Machine Learning and Knowledge Discovery in Databases: Research Track

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this paper, we propose an efficient multi-task method, named Context-aware Attentive Enrichment Network (CAENet), to deal with the problem of real-time joint semantic segmentation and depth estimation. Building upon a light-weight encoder backbone, an efficient decoder is devised to fully leverage available information from multi-scale encoder features. In particular, a new Inception Residual Pooling (IRP) module is designed to efficiently extract contextual information from the high-level features with diverse receptive fields to improve semantic understanding ability. Then the context-aware features are enriched adaptively with spatial details from low-level features via a Light-weight Attentive Fusion (LAF) module using pseudo stereoscopic attention mechanism. These two modules are progressively used in a recursive manner to generate high-resolution shared features, which are further processed by task-specific heads to produce final outputs. Such network design effectively captures beneficial information for both semantic segmentation and depth estimation tasks while largely reducing the computational budget. Extensive experiments across multi-task benchmarks validate that CAENet achieves state-of-the-art performance with comparable inference speed against other real-time competing methods. Code is available at https://github.com/wlx-zju/CAENet.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Informed Priors for Knowledge Integration in Trajectory Prediction

next chapter Click-Aware Structure Transfer with Sample Weight Assignment for Post-Click Conversion Rate Estimation

An, S., Zhou, F., Yang, M., Zhu, H., Fu, C., Tsintotas, K.A.: Real-time monocular human depth estimation and segmentation on embedded systems. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 55–62 (2021). https://doi.org/10.1109/IROS51168.2021.9636518

Arani, E., Marzban, S., Pata, A., Zonooz, B.: RGPNet: a real-time general purpose semantic segmentation. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3008–3017 (2021). https://doi.org/10.1109/WACV48630.2021.00305

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017). https://doi.org/10.1109/TPAMI.2016.2644615CrossRef

Caruana, R.: Multitask learning. Mach. Learn. 28, 41–75 (1997). https://doi.org/10.1023/A:1007379606734CrossRef

Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: scale-aware semantic image segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3640–3649 (2016). https://doi.org/10.1109/CVPR.2016.396

Chen, L., Yang, Z., Ma, J., Luo, Z.: Driving scene perception network: real-time joint detection, depth estimation and semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1283–1291 (2018). https://doi.org/10.1109/WACV.2018.00145

Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223 (2016). https://doi.org/10.1109/CVPR.2016.350

Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013, Conference Track Proceedings, pp. 1–8 (2013). https://arxiv.org/abs/1301.3572

Crawshaw, M.: Multi-task learning with deep neural networks: a survey. CoRR abs/2009.09796 (2020). https://arxiv.org/abs/2009.09796

10.

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

11.

Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2650–2658 (2015). https://doi.org/10.1109/ICCV.2015.304

12.

Fan, M., et al.: Rethinking BiSeNet for real-time semantic segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9711–9720 (2021). https://doi.org/10.1109/CVPR46437.2021.00959

13.

Gao, Y., Ma, J., Zhao, M., Liu, W., Yuille, A.L.: NDDR-CNN: layerwise feature fusing in multi-task CNNs by neural discriminative dimensionality reduction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3200–3209 (2019). https://doi.org/10.1109/CVPR.2019.00332

14.

Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015). https://doi.org/10.1109/ICCV.2015.169

15.

Goel, K., Srinivasan, P., Tariq, S., Philbin, J.: QuadroNet: multi-task learning for real-time semantic depth aware instance segmentation. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 315–324 (2021). https://doi.org/10.1109/WACV48630.2021.00036

16.

Hashimoto, K., Xiong, C., Tsuruoka, Y., Socher, R.: A joint many-task model: growing a neural network for multiple NLP tasks. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1923–1933. Association for Computational Linguistics, September 2017. https://doi.org/10.18653/v1/D17-1206, https://aclanthology.org/D17-1206

17.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

18.

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings, pp. 1–15 (2015). https://arxiv.org/abs/1412.6980

19.

Lin, B., YE, F., Zhang, Y., Tsang, I.: Reasonable effectiveness of random weighting: a litmus test for multi-task learning. In: Transactions on Machine Learning Research (2022)

20.

Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5168–5177 (2017). https://doi.org/10.1109/CVPR.2017.549

21.

Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020). https://doi.org/10.1109/TPAMI.2018.2858826CrossRef

22.

Liu, B., Liu, X., Jin, X., Stone, P., Liu, Q.: Conflict-averse gradient descent for multi-task learning. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 18878–18890. Curran Associates, Inc. (2021)

23.

Liu, S., Johns, E., Davison, A.J.: End-to-end multi-task learning with attention. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1871–1880 (2019). https://doi.org/10.1109/CVPR.2019.00197

24.

Liu, X., He, P., Chen, W., Gao, J.: Multi-task deep neural networks for natural language understanding. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4487–4496, Florence, Italy. Association for Computational Linguistics, July 2019. https://doi.org/10.18653/v1/P19-1441. https://aclanthology.org/P19-1441

25.

Liu, Y., Zhang, X.Y., Bian, J.W., Zhang, L., Cheng, M.M.: SAMNet: stereoscopically attentive multi-scale network for lightweight salient object detection. IEEE Trans. Image Process. 30, 3804–3814 (2021). https://doi.org/10.1109/TIP.2021.3065239CrossRef

26.

Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3994–4003 (2016). https://doi.org/10.1109/CVPR.2016.433

27.

Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7101–7107 (2019). https://doi.org/10.1109/ICRA.2019.8794220

28.

Nekrasov, V., Shen, C., Reid, I.: Light-weight refinenet for real-time semantic segmentation. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, 3–6 September 2018, p. 125. BMVA Press (2018). http://bmvc2018.org/contents/papers/0494.pdf

29.

Neven, D., Brabandere, B.D., Georgoulis, S., Proesmans, M., Gool, L.V.: Fast scene understanding for autonomous driving. CoRR abs/1708.02550, 1–5 (2017). https://arxiv.org/abs/1708.02550

30.

Oršic, M., Krešo, I., Bevandic, P., Šegvic, S.: In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12599–12608 (2019). https://doi.org/10.1109/CVPR.2019.01289

31.

Oršić, M., Šegvić, S.: Efficient semantic segmentation with pyramidal fusion. Pattern Recognit. 110, 107611 (2021). https://doi.org/10.1016/j.patcog.2020.107611. https://www.sciencedirect.com/science/article/pii/S0031320320304143

32.

Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 1–12. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf

33.

Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28CrossRef

34.

Rudolph, M., Dawoud, Y., Güldenring, R., Nalpantidis, L., Belagiannis, V.: Lightweight monocular depth estimation through guided decoding. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 2344–2350 (2022). https://doi.org/10.1109/ICRA46639.2022.9812220

35.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobilenetV 2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474

36.

Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54CrossRef

37.

Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594

38.

Tu, X., et al.: Efficient monocular depth estimation for edge devices in internet of things. IEEE Trans. Industr. Inf. 17(4), 2821–2832 (2021). https://doi.org/10.1109/TII.2020.3020583CrossRef

39.

Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., Van Gool, L.: Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(7), 3614–3633 (2022). https://doi.org/10.1109/TPAMI.2021.3054719CrossRef

40.

Vandenhende, S., Georgoulis, S., Van Gool, L.: MTI-Net: multi-scale task interaction networks for multi-task learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 527–543. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_31CrossRef

41.

Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2021). https://doi.org/10.1109/TPAMI.2020.2983686CrossRef

42.

Wofk, D., Ma, F., Yang, T.J., Karaman, S., Sze, V.: FastDepth: fast monocular depth estimation on embedded systems. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 6101–6108 (2019). https://doi.org/10.1109/ICRA.2019.8794182

43.

Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1CrossRef

44.

Xu, D., Ouyang, W., Wang, X., Sebe, N.: PAD-Net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 675–684 (2018). https://doi.org/10.1109/CVPR.2018.00077

45.

Yang, Y., Hospedales, T.M.: Deep multi-task representation learning: a tensor factorisation approach. In: International Conference on Learning Representations, pp. 1–12 (2017). https://openreview.net/forum?id=SkhU2fcll

46.

Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_20CrossRef

Title: CAENet: Efficient Multi-task Learning for Joint Semantic Segmentation and Depth Estimation
Authors: Luxi Wang
Yingming Li
Publisher: Springer Nature Switzerland
Book: Machine Learning and Knowledge Discovery in Databases: Research Track
Print ISBN: 978-3-031-43423-5

Electronic ISBN: 978-3-031-43424-2

Copyright Year: 2023
DOI: https://doi.org/10.1007/978-3-031-43424-2_25

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner