nach oben

International Journal of Computer Vision

Erschienen in:

11.11.2022

Robots Understanding Contextual Information in Human-Centered Environments Using Weakly Supervised Mask Data Distillation

verfasst von: Daniel Dworakowski, Angus Fung, Goldie Nejat

Erschienen in: International Journal of Computer Vision | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Contextual information contained within human environments, such as text on signs, symbols and objects provide important information for robots to use for exploration and navigation. To identify and segment contextual information from images obtained in these environments data-driven methods such as Convolutional Neural Networks (CNNs) can be used. However, these methods require significant amounts of human labeled data which is time-consuming to obtain. In this paper, we present the novel Weakly Supervised Mask Data Distillation (WeSuperMaDD) architecture for autonomously generating pseudo segmentation labels (PSLs) using CNNs not specifically trained for the task of text segmentation, e.g., CNNs alternatively trained for: object classification or image captioning. WeSuperMaDD is uniquely able to generate PSLs using learned image features from datasets that are sparse and with limited diversity, which are common in robot navigation tasks in human-centred environments (i.e., malls, stores). Our proposed architecture uses a new mask refinement system which automatically searches for the PSL with the fewest foreground pixels that satisfies cost constraints. This removes the need for handcrafted heuristic rules. Extensive experiments were conducted to validate the performance of WeSuperMaDD in generating PSLs for datasets containing text of various scales, fonts, orientations, curvatures, and perspectives in several indoor/outdoor environments. A detailed comparison study conducted with existing approaches found a significant improvement in PSL quality. Furthermore, an instance segmentation CNN trained using the WeSuperMaDD architecture achieved measurable improvements in accuracy when compared to an instance segmentation CNN trained with Naïve PSLs. We also found our method to have comparable performance to existing text detection methods.

Vorheriger Artikel Guest Editorial: Special Issue on Advances in Computer Vision and Applications (ACCV 2020)

Nächster Artikel An Efficient Model for a Camera Behind a Parallel Refractive Slab

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2016). NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5297–5307).

Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., et al. (2019a). What is wrong with scene text recognition model comparisons? Dataset and model analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4715–4723).

Baek, Y., Lee, B., Han, D., Yun, S., & Lee, H. (2019b). Character Region Awareness for Text Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9365–9374).

Barnes, D., Maddern, W., & Posner, I. (2017). Find your own way: Weakly-supervised segmentation of path proposals for urban autonomy. In 2017 IEEE International Conference on Robotics and Automation (ICRA) (pp. 203–210).

Bellocchio, E., Ciarfuglia, T. A., Costante, G., & Valigi, P. (2019). Weakly supervised fruit counting for yield estimation using spatial consistency. IEEE Robotics and Automation Letters, 4(3), 2348–2355.CrossRef

Benenson, R., Popov, S., & Ferrari, V. (2019). Large-scale interactive object segmentation with human annotators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11700–11709).

Bojarski, M., Choromanska, A., Choromanski, K., Firner, B., Jackel, L., Muller, U., & Zieba, K. (2018). VisualBackProp: efficient visualization of CNNs. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 4701–4708).

Bonechi, S., Andreini, P., Bianchini, M., & Scarselli, F. (2019). COCO_TS Dataset: Pixel–level annotations based on weak supervision for scene text segmentation. In International Conference on Artificial Neural Networks and Machine Learning (pp. 238–250). Cham: Springer.

Case, C., Suresh, B., Coates, A.,& Ng, A. Y., (2011). Autonomous sign reading for semantic mapping. In 2011 IEEE international Conference on Robotics and Automation (pp. 3297–3303).

Chapelle, O., Schlkopf, B., & Zien, A. (2010). Semi-supervised learning (1st ed.). The MIT Press.

Ch’ng, C. K., & Chan, C. S. (2017). Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (pp. 935–942).

Cleveland, J., Thakur, D., Dames, P., Phillips, C., Kientz, T., Daniilidis, K., et al. (2017). Automated system for semantic object labeling with soft-object recognition and dynamic programming segmentation. IEEE Transactions on Automation Science and Engineering, 14(2), 820–833.CrossRef

Deng, L., Gong, Y., Lin, Y., Shuai, J., Tu, X., Zhang, Y., et al. (2019b). Detecting multi-oriented text with corner-based region proposals. Neurocomputing, 334, 134–142.CrossRef

Deng, L., Gong, Y., Lu, X., Lin, Y., Ma, Z., & Xie, M. (2019a). STELA: A real-time scene text detector with learned anchor. IEEE Access, 7, 153400–153407.CrossRef

Dworakowski, D., Thompson, C., Pham-Hung, M., & Nejat, G. (2021). A robot architecture using contextSLAM to find products in unknown crowded retail environments. Robotics, 10(4), 110.CrossRef

Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.CrossRef

Fu, C.-Y., Shvets, M., & Berg, A. C. (2019). RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free. arxiv.

Gregorio, D. D., Tonioni, A., Palli, G., & Stefano, L. D. (2020). Semiautomatic labeling for deep learning in robotics. IEEE Transactions on Automation Science and Engineering, 17(2), 611–620.CrossRef

Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2315–2324).

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2961–2969).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern RecognitioN (pp. 770–778).

Hou, Q., Massiceti, D., Dokania, P. K., Wei, Y., Cheng, M.-M., & Torr, P. H. (2017). Bottom-up top-down cues for weakly-supervised semantic segmentation. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition (pp. 263–277). Springer.

Huang, J., Sivakumar, V., Mnatsakanyan, M., & Pang, G. (2018). Improving rotated text detection with rotation region proposal networks. arxiv.

Ibrahim, M. S., Vahdat, A., & Macready, W. G. (2018). Weakly supervised semantic image segmentation with self-correcting networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12715–12725).

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arxiv.

Jain, S. D., & Grauman, K. (2013). Predicting sufficient annotation strength for interactive foreground segmentation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1313–1320).

Jing, L., Chen, Y., & Tian, Y. (2020). Coarse-to-fine semantic segmentation from image-level labels. IEEE Transactions on Image Processing, 29, 225–236.MathSciNetCrossRefMATH

Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., et al. (2015). ICDAR 2015 competition on robust reading. In 13th International Conference on Document Analysis and Recognition (pp. 1156–1160).

Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L. G. i, Mestre, S. R., et al. (2013). ICDAR 2013 robust reading competition. In 12th International Conference on Document Analysis and Recognition (pp. 1484–1493).

Khoreva, A., Benenson, R., Hosang, J., Hein, M., & Schiele, B. (2017). Simple does it: weakly supervised instance and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 876–885).

Kolesnikov, A., & Lampert, C. H. (2016). Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV (pp. 695–711).

Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2005). Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30, 25–36.

Li, G., Xie, Y., & Lin, L. (2018). Weakly supervised salient object detection using image labels. In AAAI Conf. on Artificial Intelligence (pp. 7024–7031).

Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., & Feng, J. (2020). Overcoming Classifier Imbalance for Long-Tail Object Detection With Balanced Group Softmax. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10991–11000).

Liang, H., Sanket, N. J., Fermüller, C., & Aloimonos, Y. (2019). SalientDSO: Bringing attention to direct sparse odometry. IEEE Transactions on Automation Science and Engineering, 16(4), 1619–1626.CrossRef

Liao, M., Shi, B., & Bai, X. (2018a). Textboxes++: A Single-Shot Oriented Scene Text Detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.MathSciNetCrossRefMATH

Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., et al. (2018b). Scene text recognition from two-dimensional perspective, arXiv.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P., et al. (2017). Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988).

Liu, J., Liu, X., Sheng, J., Liang, D., Li, X., & Liu, Q. (2019). Pyramid mask text detector.

Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018). FOTS fast oriented text spotting with a unified network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5676–5685).

Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018a). Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV).

Lyu, P., Yao, C., Wu, W., Yan, S., & Bai, X. (2018b). Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7553–7563)

Mahendran, A., & Vedaldi, A. (2016). Salient deconvolutional networks. Computer vision—ECCV 2016 (pp. 120–135). Springer.CrossRef

Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Scene text recognition using higher order language priors. In British Machine Vision Conference (p. 127.1–127.11).

Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., et al. (2017). ICDAR2017 Robust reading challenge on multi-lingual scene text detection and script identification—RRC-MLT. In 2017 14th IAPR International Conference on Document Analysis and Recognition (pp. 1454–1459).

Niu, S., Lin, H., Niu, T., Li, B., & Wang, X. (2019). DefectGAN: Weakly-supervised defect detection using generative adversarial network. In IEEE International Conference on Automation Science and Engineering (pp. 127–132).

Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1), 62–66.CrossRef

Overview—Focused Scene Text - Robust Reading Competition. (n.d.). Robust Reading Competition. https://rrc.cvc.uab.es/?ch=2. Accessed 20 November 2020

Overview—ICDAR2017 Competition on Multi-lingual scene text detection and script identification - Robust Reading Competition. (2017, January 4). Robust Reading Competition. https://rrc.cvc.uab.es/?ch=8. Accessed 20 November 2020

Overview—Incidental scene text - robust reading competition. (n.d.). Robust Reading Competition. https://rrc.cvc.uab.es/?ch=4. Accessed 20 November 2020

Peng, Z., Gao, S., Xiao, B., Guo, S., & Yang, Y. (2018). CrowdGIS: Updating digital maps via mobile crowdsensing. IEEE Transactions on Automation Science and Engineering, 15(1), 369–380.CrossRef

Pont-Tuset, J., Arbeláez, P., Barron, J. T., Marques, F., & Malik, J. (2017). Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 128–140.CrossRef

Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., & He, K. (2018). Data distillation: Towards omni-supervised learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4119–4128).

Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.CrossRef

Rother, C., Kolmogorov, V., & Blake, A. (2004). “GrabCut”: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314.CrossRef

Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arxiv.

Saleh, F. S., Aliakbarian, M. S., Salzmann, M., Petersson, L., Alvarez, J. M., & Gould, S. (2018). Incorporating network built-in priors in weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1382–1396.CrossRef

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020). Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128(2), 336–359.CrossRef

Shariati, A., Holz, C., & Sinha, S. (2020). Towards privacy-preserving ego-motion estimation using an extremely low-resolution camera. IEEE Robotics and Automation Letters, 5(2), 1222–1229.CrossRef

Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298–2304.CrossRef

Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In Workshop at International Conference on Learning Representations.

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arxiv.

Singh, A., Yang, L., & Levine, S. (2017). GPLAC: Generalizing vision-based robotic skills using weakly labeled images. In Proceedings of the IEEE International Conference on Computer Vision (pp. 5851–5860).

Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., et al. (2019). Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2446–2454).

Thompson, C., Khan, H., Dworakowski, D., Harrigan, K., & Nejat, G. (2018). An autonomous shopping assistance robot for grocery stores. In IEEE/RSJ Proceedings of the Workshop on Robotic Co-workers 4.0.

Vardazaryan, A., Mutter, D., Marescaux, J., & Padoy, N., et al. (2018). Weakly-supervised learning for tool localization in laparoscopic videos. In D. Stoyanov, Z. Taylor, S. Balocco, R. Sznitman, A. Martel, & L. Maier-Hein (Eds.), Intravascular imaging and computer assisted stenting and large-scale annotation of biomedical data and expert label synthesis (pp. 169–179). Springer.CrossRef

Vilar, E., Rebelo, F., & Noriega, P. (2014). Indoor human wayfinding performance using vertical and horizontal signage in virtual reality. Human Factors and Ergonomics in Manufacturing & Service Industries, 24(6), 601–615.CrossRef

Wan, F., Liu, C., Ke, W., Ji, X., Jiao, J., & Ye, Q. (2019). C-MIL: Continuation multiple instance learning for weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wan, F., Wei, P., Jiao, J., Han, Z., & Ye, Q. (2018). Min-entropy latent model for weakly supervised object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Wang, B. H., Chao, W., Wang, Y., Hariharan, B., Weinberger, K. Q., & Campbell, M. (2019). LDLS: 3-D object segmentation through label diffusion from 2-D images. IEEE Robotics and Automation Letters, 4(3), 2902–2909.CrossRef

Wang, C., Zhao, S., Zhu, L., Luo, K., Guo, Y., Wang, J., & Liu, S. (2021). Semi-supervised pixel-level scene text segmentation by mutually guided network. IEEE Transactions on Image Processing, 30, 8212–8221.CrossRef

Wang, H., Finn, C., Paull, L., Kaess, M., Rosenholtz, R., Teller, S., & Leonard, J. (2015). Bridging text spotting and SLAM with junction features. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 3701–3708).

Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., & Ruan, X. (2017). Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 136–145).

Wei, Y., Feng, J., Liang, X., Cheng, M.-M., Zhao, Y., & Yan, S. (2017). Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1568–1576).

Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., & Huang, T. S. (2018). Revisiting dilated convolution: A simple approach for weakly- and semi-supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7268–7277).

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. (2020). https://github.com/clovaai/deep-text-recognition-benchmark. Accessed 6 June 2020

Wellhausen, L., Dosovitskiy, A., Ranftl, R., Walas, K., Cadena, C., & Hutter, M. (2019). Where should i walk? Predicting terrain properties from images via self-supervised learning. IEEE Robotics and Automation Letters, 4(2), 1509–1516.CrossRef

Wu, W., Xie, E., Zhang, R., Wang, W., Pang, G., Li, Z., et al. (2020). SelfText beyond polygon: Unconstrained text detection with box supervision and dynamic self-training, arXiv.

Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K., test, & tst. (2017). Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1492–1500).

Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., & Savarese, S. (2018). Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3712–3722).

Zhang, B., Xiao, J., Wei, Y., Sun, M., & Huang, K. (2019). Reliability does matter: An End-to-end weakly supervised semantic segmentation approach. arxiv.

Zhang, J., Bargal, S. A., Lin, Z., Brandt, J., Shen, X., & Sclaroff, S. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084–1102.CrossRef

Zhao, X., Liang, S., & Wei, Y. (2018). Pseudo mask augmented object detection. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition (pp. 4061–4070).

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision, 127(3), 302–321.CrossRef

Zhou, Y., Zhu, Y., Ye, Q., Qiu, Q., & Jiao, J. (2018). Weakly supervised instance segmentation using class peak response. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3791–3800).

Zhou, Z.-H. (2017). A brief introduction to weakly supervised learning. National Science Review, 5(1), 44–53.CrossRef

Titel: Robots Understanding Contextual Information in Human-Centered Environments Using Weakly Supervised Mask Data Distillation
verfasst von: Daniel Dworakowski
Angus Fung
Goldie Nejat
Publikationsdatum: 11.11.2022
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 2/2023
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-022-01706-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 2/2023

Recurrent Graph Neural Networks for Video Instance Segmentation

Domain-Specific Bias Filtering for Single Labeled Domain Generalization

An Efficient Model for a Camera Behind a Parallel Refractive Slab

Guest Editorial: Special Issue on Computer Vision from 2D to 3D

Animal Pose Tracking: 3D Multimodal Dataset and Token-based Pose Optimization

Guest Editorial: Special Issue on Advances in Computer Vision and Applications (ACCV 2020)

Premium Partner