nach oben

International Journal of Computer Vision

Erschienen in:

06.11.2023

Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose

verfasst von: Yaokun Li, Guang Tan, Chao Gou

Erschienen in: International Journal of Computer Vision | Ausgabe 4/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Landmark detection under large pose with occlusion has been one of the challenging problems in the field of facial analysis. Recently, many works have predicted pose or occlusion together in the multi-task learning (MTL) paradigm, trying to tap into their dependencies and thus alleviate this issue. However, such implicit dependencies are weakly interpretable and inconsistent with the way humans exploit inter-task coupling relations, i.e., accommodating the induced explicit effects. This is one of the essentials that hinders their performance. To this end, in this paper, we propose a Cascaded Iterative Transformer (CIT) to jointly predict facial landmark, occlusion probability, and pose. The proposed CIT, besides implicitly mining task dependencies in a shared encoder, innovatively employs a cost-effective and portability-friendly strategy to pass the decoders’ predictions as prior knowledge to human-like exploit the coupling-induced effects. Moreover, to the best of our knowledge, no dataset contains all these task annotations simultaneously, so we introduce a new dataset termed MERL-RAV-FLOP based on the MERL-RAV dataset. We conduct extensive experiments on several challenging datasets (300W-LP, AFLW2000-3D, BIWI, COFW, and MERL-RAV-FLOP) and achieve remarkable results. The code and dataset can be accessed in https://github.com/Iron-LYK/CIT.

Vorheriger Artikel Skeleton Ground Truth Extraction: Methodology, Annotation Tool and Benchmarks

Nächster Artikel Universal Object Detection with Large Vision Model

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

FLOP: Facial Landmark, Occlusion and Pose.

Aghli, N., & Ribeiro, E. (2021). A data-driven approach to improve 3D head-pose estimation. In International Symposium on Visual Computing (pp. 546–558).

Albiero, V., Chen, X., Yin, X., Pang, G., & Hassner, T. (2021). img2pose: Face alignment and detection via 6dof, face pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7617–7627).

Bhagavatula, C., Zhu, C., Luu, K., & Savvides, M. (2017). Faster than real-time facial alignment: A 3D spatial transformer network approach in unconstrained poses. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3980–3989).

Bisogni, C., Nappi, M., Pero, C., & Ricciardi, S. (2021). Fashe: A fractal based strategy for head pose estimation. IEEE Transactions on Image Processing, 30, 3192–3203.CrossRef

Bulat, A., & Tzimiropoulos, G. (2017). How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision (pp. 1021–1030).

Burgos-Artizzu, X.P., Perona, P., & Dollár, P. (2013). Robust face landmark estimation under occlusion. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1513–1520).

Cao, X., Wei, Y., Wen, F., & Sun, J. (2014). Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2), 177–190.MathSciNetCrossRef

Cao, Z., Chu, Z., Liu, D., & Chen, Y. (2021). A vector-based representation to enhance head pose estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 1188–1197).

Deng, J., Guo, J., Ververas, E., Kotsia, I., & Zafeiriou, S. (2020). Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5203–5212).

Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., & Sun, J. (2021). Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13733–13742).

Dong, X. , Yu, S.-I. , Weng, X. , Wei, S.-E. , Yang, Y. , & Sheikh, Y. (2018). Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 360–368).

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Fanelli, G., Dantone, M., Gall, J., Fossati, A., & Van Gool, L. (2013). Random forests for real time 3D face analysis. International Journal of Computer Vision, 101(3), 437–458.CrossRef

Fard, A. P., Abdollahi, H., & Mahoor, M. (2021). Asmnet: A lightweight deep neural network for face alignment and pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1521–1530).

Fard, A. P., & Mahoor, M. H. (2022). Acr loss: Adaptive coordinate-based regression loss for face alignment. arXiv preprint arXiv:2203.15835.

Feng, Z.-H., Kittler, J., Awais, M., & Wu, X.-J. (2020). Rectified wing loss for efficient and robust facial landmark localisation with convolutional neural networks. International Journal of Computer Vision, 128(8), 2126–2145.CrossRef

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).

Hempel, T., Abdelrahman, A. A., & Al-Hamadi, A. (2022). 6D rotation representation for unconstrained head pose estimation. arXiv preprint arXiv:2202.12555.

Hsu, H.-W., Wu, T.-Y., Wan, S., Wong, W. H., & Lee, C.-Y. (2018). Quatnet: Quaternion-based head pose estimation with multiregression loss. IEEE Transactions on Multimedia, 21(4), 1035–1046.CrossRef

Jin, H., Liao, S., & Shao, L. (2021). Pixel-in-pixel net: Towards efficient facial landmark detection in the wild. International Journal of Computer Vision, 129(12), 3174–3194.CrossRef

Jourabloo, A., & Liu, X. (2015). December. Pose-invariant 3D face alignment. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

Ju, Y.-J. , Lee, G.-H. , Hong, J.-H. , & Lee, S.-W. (2022). Complete face recovery GAN: Unsupervised joint face rotation and de-occlusion from a single-view image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3711–3721).

Kipf, T. N. , & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

Koestinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 2144–2151).

Kumar, A., Alavi, A., & Chellappa, R. (2017). Kepler: Keypoint and pose estimation of unconstrained faces by learning efficient h-cnn regressors. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (fg 2017) (pp. 258–265).

Kumar, A., Marks, T. K., Mou, W., Wang, Y., Jones, M., Cherian, A., & Feng, C. (2020). Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8236–8246).

Lan, X., Hu, Q., Chen, Q., Xue, J., & Cheng, J. (2021). Hih: Towards more accurate face alignment via heatmap in heatmap. arXiv preprint arXiv:2104.03100.

Lan, X., Hu, Q., & Cheng, J. (2022). Atf: An alternating training framework for weakly supervised face alignment. IEEE Transactions on Multimedia.

Li, H., Guo, Z., Rhee, S.-M., Han, S., & Han, J.-J. (2022). Towards accurate facial landmark detection via cascaded transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4176–4185).

Li, J., Jin, H., Liao, S., Shao, L., & Heng, P.-A. (2022). Repformer: Refinement pyramid transformer for robust facial landmark detection. arXiv preprint arXiv:2207.03917.

Li, Y.-K., Yu, Y.-Z., Liu, Y.-L., Gou, C. (2022). MS-GCN: Multi-stream graph convolution network for driver head pose estimation. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC) (pp. 3819–3824).

Mahpod, S., Das, R., Maiorana, E., Keller, Y., & Campisi, P. (2021). Facial landmarks localization using cascaded neural networks. Computer Vision and Image Understanding, 205, 103171.CrossRef

Mo, S., & Miao, X. (2021). OSGG-net: One-step graph generation network for unbiased head pose estimation. In Proceedings of the 29th ACM International Conference on Multimedia (pp. 2465–2473).

Ranjan, R., Patel, V. M., & Chellappa, R. (2017). Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1), 121–135.CrossRef

Ruan, Z., Zou, C., Wu, L., Wu, G., & Wang, L. (2021). Sadrnet: Self-aligned dual face regression networks for robust 3d dense face alignment and reconstruction. IEEE Transactions on Image Processing, 30, 5793–5806.CrossRef

Ruiz, N., Chong, E., & Rehg, J. M. (2018). Fine-grained head pose estimation without keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 2074–2083).

Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 397–403).

Shen, J., Zafeiriou, S., Chrysos, G. G., Kossaifi, J., Tzimiropoulos, G., & Pantic, M. (2015). The first facial landmark tracking in-the-wild challenge: Benchmark and results. In Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 50–58).

Tong, Z., & Zhou, J. (2021). Face alignment using two-stage cascaded pose regression and mirror error correction. Pattern Recognition, 115, 107866.CrossRef

Valle, R., Buenaposada, J. M., & Baumela, L. (2020). Cascade of encoder-decoder CNNS with learned coordinates regressor for robust facial landmarks detection. Pattern Recognition Letters, 136, 326–332.CrossRef

Valle, R. , Buenaposada, J. M. , & Baumela, L. (2020b). Multi-task head pose estimation in-the-wild. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Valle, R., Buenaposada, J. M., Valdés, A., & Baumela, L. (2019). Face alignment using a 3d deeply-initialized ensemble of regression trees. Computer Vision and Image Understanding, 189, 102846.CrossRef

Vandenhende, S., Georgoulis, S., Van Gansbeke, W., Proesmans, M., Dai, D., & Van Gool, L. (2021). Multi-task learning for dense prediction tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, vol. 30.

Wang, H., Chen, Z., & Zhou, Y. (2019). Hybrid coarse-fine classification for head pose estimation. arXiv preprint arXiv:1901.06778.

Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., et al. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364.CrossRef

Wang, Y., Cao, M., Fan, Z., & Peng, S. (2022). Learning to detect 3D facial landmarks via heatmap regression with graph convolutional network. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, pp. 2595–2603).

Wu, C.-Y. , Xu, Q., & Neumann, U. (2021). Synergy between 3DMM and 3D landmarks for accurate 3D facial geometry. In 2021 International Conference on 3D Vision (3DV) (pp. 453–463).

Wu, Y., & Ji, Q. (2019). Facial landmark detection: A literature survey. International Journal of Computer Vision, 127(2), 115–142.CrossRef

Xia, H., Liu, G., Xu, L., & Gan, Y. (2022a). Collaborative learning network for head pose estimation. Image and Vision Computing 104555.

Xia, J. , Qu, W. , Huang, W. , Zhang, J. , Wang, X. , & Xu, M. (2022a). Sparse local patch transformer for robust face alignment and landmarks inherent relation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4052–4061).

Xia, J., Zhang, H., Wen, S., Yang, S., & Xu, M. (2022c). An efficient multitask neural network for face alignment, head pose estimation and face tracking. Expert Systems with Applications 117368.

Xia, J., Zhang, H., Wen, S., Yang, S., & Xu, M. (2022d). An efficient multitask neural network for face alignment, head pose estimation and face tracking. Expert Systems with Applications 117368.

Xin, M., Mo, S., & Lin, Y. (2021). Eva-GCN: Head pose estimation based on graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1462–1471).

Yang, S., Quan, Z., Nie, M., & Yang, W. (2021). Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11802–11812).

Yang, T.-Y., Chen, Y.-T., Lin, Y.-Y., & Chuang, Y.-Y. (2019). Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1087–1096).

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., & Liu, T.-Y. (2021). Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34, 28877–28888.

Yu, R., Saito, S., Li, H., Ceylan, D., & Li, H. (2017). Learning dense facial correspondences in unconstrained images. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4723–4732).

Zhang, H., Li, Q., Sun, Z., & Liu, Y. (2018). Combining data-driven and model-driven methods for robust facial landmark detection. IEEE Transactions on Information Forensics and Security, 13(10), 2409–2422.CrossRef

Zhang, H., Wang, M., Liu, Y., & Yuan, Y. (2020). Fdn: Feature decoupling network for head pose estimation. In Proceedings of the AAAI Conference on Artificial Intelligence (VOL. 34, pp. 12789–12796).

Zhang, J., Kan, M., Shan, S., & Chen, X. (2016). Occlusion-free face alignment: Deep regression networks coupled with de-corrupt autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3428–3437).

Zhang, Z., Cui, Z., Xu, C., Jie, Z., Li, X., & Yang, J. (2018). Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 235–251).

Zhou, J., Li, M., & Pan, Y. (2022). Robust facial landmark localization based on texture and pose correlated initialization. In The International Conference on Image, Vision and Intelligent Systems (ICIVIS 2021) (pp. 625–634).

Zhou, Y., & Gregson, J. (2020). Whenet: Real-time fine-grained estimation for wide range head pose. arXiv preprint arXiv:2005.10353.

Zhu, C., Li, X., Li, J., Dai, S., & Tong, W. (2022). Reasoning structural relation for occlusion-robust facial landmark localization. Pattern Recognition, 122, 108325.CrossRef

Zhu, X., Lei, Z., Liu, X., Shi, H., & Li, S. Z. (2016). Face alignment across large poses: A 3D solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 146–155).

Zhu, X., Liu, X., Lei, Z., & Li, S. Z. (2017). Face alignment in full pose range: A 3d total solution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1), 78–92.CrossRef

Titel: Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose
verfasst von: Yaokun Li
Guang Tan
Chao Gou
Publikationsdatum: 06.11.2023
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 4/2024
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-023-01935-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 4/2024

CCR: Facial Image Editing with Continuity, Consistency and Reversibility

InstaFormer++: Multi-Domain Instance-Aware Image-to-Image Translation with Transformer

Going Deeper into Recognizing Actions in Dark Environments: A Comprehensive Benchmark Study

PartCom: Part Composition Learning for 3D Open-Set Recognition

Guest Editorial: Special Issue on the Promises and Dangers of Large Vision Models

Learning Portrait Drawing with Unsupervised Parts

Premium Partner