Top

Published in:

2021 | OriginalPaper | Chapter

Skeleton-Based Methods for Speaker Action Classification on Lecture Videos

Authors : Fei Xu, Kenny Davila, Srirangaraj Setlur, Venu Govindaraju

Published in: Pattern Recognition. ICPR International Workshops and Challenges

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The volume of online lecture videos is growing at a frenetic pace. This has led to an increased focus on methods for automated lecture video analysis to make these resources more accessible. These methods consider multiple information channels including the actions of the lecture speaker. In this work, we analyze two methods that use spatio-temporal features of the speaker skeleton for action classification in lecture videos. The first method is the AM Pose model which is based on Random Forests with motion-based features. The second is a state-of-the-art action classifier based on a two-stream adaptive graph convolutional network (2S-AGCN) that uses features of both joints and bones of the speaker skeleton. Each video is divided into fixed-length temporal segments. Then, the speaker skeleton is estimated on every frame in order to build a representation for each segment for further classification. Our experiments used the AccessMath dataset and a novel extension which will be publicly released. We compared four state-of-the-art pose estimators: OpenPose, Deep High Resolution, AlphaPose and Detectron2. We found that AlphaPose is the most robust to the encoding noise found in online videos. We also observed that 2S-AGCN outperforms the AM Pose model by using the right domain adaptations.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter A Hierarchical Framework for Motion Trajectory Forecasting Based on Modality Sampling

next chapter Fake Review Classification Using Supervised Machine Learning

https://www.youtube.com/playlist?list=PLg2YxOqXd_2Ptnj2adKJRngjD1TK7fAo5.

https://kdavila.github.io/lecturemath/.

Xu, F., Davila, K., Setlur, S., Govindaraju, V.: Content extraction from lecture video via speaker action classification based on pose information. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1047–1054. IEEE (2019)

Davila, K., Agarwal, A., Gaborski, R., Zanibbi, R., Ludi, S.: Accessmath: indexing and retrieving video segments containing math expressions based on visual similarity. In: 2013 Western New York Image Processing Workshop (WNYIPW), pp. 14–17. IEEE (2013)

Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: 2019 Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12018–12027. IEEE/CVF (2019)

Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)

Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: 2019 Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5693–5703. IEEE/CVF (2019)

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2

Fang, H.-S., Xie, S., Tai, Y.-W., Lu., C.: Rmpe: regional multi-person pose estimation. In: 2017 International Conference on Computer Vision (ICCV), pp. 2353–2362. IEEE/CVF (2017)

Chen, Y., Tian, Y., He, M.: Monocular human pose estimation: a survey of deep learning-based methods. Computer Vision and Image Understanding, pp. 102897 (2020)

Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)MathSciNetCrossRef

10.

He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 International Conference on Computer Vision (ICCV), pp. 2961–2969. IEEE/CVF (2017)

11.

Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015

12.

Newell, A., Yang, K., Deng, J.: Stacked Hourglass Networks for Human Pose Estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29CrossRef

13.

Nguyen, T.V., Song, Z., Yan, S.: Stap: spatial-temporal attention-aware pooling for action recognition. IEEE Trans. Circuits Syst. Video Technol. 25(1), 77–86 (2014)CrossRef

14.

Yi, Y., Zheng, Z., Lin, M.: Realistic action recognition with salient foreground trajectories. Expert Syst. Appl. 75, 44–55 (2017)CrossRef

15.

Wang, P., Li, W., Li, C., Hou, Y.: Action recognition based on joint trajectory maps with convolutional neural networks. Knowl.-Based Syst. 158, 43–53 (2018)CrossRef

16.

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second Conference on Artificial Intelligence (AAAI) (2018)

17.

Ma, D., Xie, B., Agam, G.: A machine learning based lecture video segmentation and indexing algorithm. In: Document Recognition and Retrieval XXI, vol. 9021, pp. 90210V. International Society for Optics and Photonics (2014)

18.

Davila, K., Zanibbi, R.: Whiteboard video summarization via spatio-temporal conflict minimization. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 355–362. IEEE (2017)

19.

Davila, K., Zanibbi, R.: Visual search engine for handwritten and typeset math in lecture videos and latex notes. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 50–55. IEEE (2018)

20.

Kota, B.U., Davila, K., Stone, A., Setlur, S., Govindaraju, V.: Generalized framework for summarization of fixed-camera lecture videos by detecting and binarizing handwritten content. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 221–233 (2019)CrossRef

21.

Soares, E.R., Barrére, E.: An optimization model for temporal video lecture segmentation using word2vec and acoustic features. In: 25th Brazillian Symposium on Multimedia and the Web, pp. 513–520 (2019)

22.

Shah, R.R., Yu, Y., Shaikh, A.D., Zimmermann, R.: Trace: linguistic-based approach for automatic lecture video segmentation leveraging wikipedia texts. In: 2015 IEEE International Symposium on Multimedia (ISM), pp. 217–220. IEEE (2015)

23.

Yang, H., Meinel, C.: Content based lecture video retrieval using speech and video text information. IEEE Trans. Learn. Technol. 7(2), 142–154 (2014)CrossRef

24.

Radha, N.: Video retrieval using speech and text in video. In: 2016 International Conference on Inventive Computation Technologies (ICICT), vol. 2, pp. 1–6. IEEE (2016)

25.

Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48CrossRef

Title: Skeleton-Based Methods for Speaker Action Classification on Lecture Videos
Authors: Fei Xu
Kenny Davila
Srirangaraj Setlur
Venu Govindaraju
Publisher: Springer International Publishing
Book: Pattern Recognition. ICPR International Workshops and Challenges
Print ISBN: 978-3-030-68798-4

Electronic ISBN: 978-3-030-68799-1

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-3-030-68799-1_18

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner