nach oben

World Wide Web

Erschienen in:

03.05.2018

Multimodal deep representation learning for video classification

verfasst von: Haiman Tian, Yudong Tao, Samira Pouyanfar, Shu-Ching Chen, Mei-Ling Shyu

Erschienen in: World Wide Web | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16% and 7% compared to the best results from single modality and fusion models, respectively.

Vorheriger Artikel A novel image retrieval algorithm based on transfer learning and fusion features

Nächster Artikel Self-feeding frequency estimation and eating action recognition from skeletal representation using Kinect

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed.Syst. 16(6), 345–379 (2010)CrossRef

Chen, S.C., Shyu, M.L., Kashyap, R.L.: Augmented transition network as a semantic model for video data. Int. J. Network. Inf. Syst. Special Issue Video Data 3 (1), 9–25 (2000)

Chen, S.C., Shyu, M.L., Chen, M., Zhang, C.: A decision tree-based multimodal data mining framework for soccer goal detection. In: IEEE International conference on multimedia and expo, pp. 265–268 (2004)

Chen, S.C., Shyu, M.L., Zhang, C.: Innovative shot boundary detection for video indexing. Video Data Manag. Inf. Retriev, 217–236 (2005)

Chen, M., Chen, S.C., Shyu, M.L., Zhang, C.: Video event mining via multimodal content analysis and classification. In: Petrushin, V. A., Khan, L. (eds.) Multimedia Data Mining and Knowledge Discovery, pp. 234–258. Springer, London (2007)

Chen, X., Zhang, C., Chen, S.C., Rubin, S.: A human-centered multiple instance learning framework for semantic video retrieval. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 39(2), 228–233 (2009)CrossRef

Chen, C., Zhu, Q., Lin, L., Shyu, M.L.: Web media semantic concept retrieval via tag removal and model fusion. ACM Trans. Intell. Syst. Technol. 4(4), 61 (2013)

Deng, L., Yu, D., et al.: Deep learning: Methods and applications. Foundations and Trends®;, in Signal Processing 7(3–4), 197–387 (2014)MathSciNetCrossRefMATH

Fleury, A., Vacher, M., Noury, N: SVM-based multimodal classification of activities of daily living in health smart homes: sensors, algorithms, and first experimental results. IEEE Trans. Inf. Technol. Biomed. 14(2), 274–283 (2010)CrossRef

10.

Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw. 18(5), 602–610 (2005)CrossRef

11.

Ha, H.Y., Yang, Y., Pouyanfar, S., Tian, H., Chen, S.C.: Correlation-based deep learning for multimedia semantic concept detection. In: International Conference on Web Information Systems Engineering, pp. 473–487 (2015)

12.

Hannun, A.Y., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y.: Deep speech: scaling up end-to-end speech recognition. CoRR arXiv:1412.5567 (2014)

13.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

14.

Johnson, R., Zhang, T.: Supervised and semi-supervised text categorization using lstm for region embeddings. In: International Conference on Machine Learning, pp. 526–534. JMLR.org (2016)

15.

Kahou, S.E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V., Konda, K., Jean, S., Froumenty, P., Dauphin, Y., Boulanger-Lewandowski, N., et al.: Emonets: multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interf. 10(2), 99–111 (2016)CrossRef

16.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

17.

Kim, Y.: Convolutional neural networks for sentence classification. In: Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751. ACL (2014)

18.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

19.

Lan, Z., Bao, L., Yu, S., Liu, W., Hauptmann, A.G.: Multimedia classification and event detection using double fusion. Multimed. Tools Appl. 71(1), 333–347 (2014)CrossRef

20.

LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRef

21.

Li, T., Xie, N., Zeng, C., Zhou, W., Zheng, L., Jiang, Y., Yang, Y., Ha, H., Xue, W., Huang, Y., Chen, S., Navlakha, J. K., Iyengar, S. S.: Data-driven techniques in disaster information management. ACM Comput. Surv. 50 (1), 1 (2017)CrossRef

22.

Lin, L., Shyu, M.L.: Weighted association rule mining for video semantic detection. Methods Innov. Multimed. Database Content Manag. 1(1), 37–54 (2012)

23.

Meng, T., Shyu, M.L.: Leveraging concept association network for multimedia rare concept mining and retrieval. In: IEEE International Conference on Multimedia and Expo, pp. 860–865 (2012)

24.

Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the Web. In: International Conference on Multimodal Interfaces, pp. 169–176. ACM (2011)

25.

Mostafa, M.M.: More than words: social networks’ text mining for consumer brand sentiments. Expert Syst. Appl. 40(10), 4241–4251 (2013)CrossRef

26.

Pantic, M., Sebe, N., Cohn, J.F., Huang, T.: Affective multimodal human-computer interaction. In: ACM International Conference on Multimedia, pp. 669–676. ACM (2005)

27.

Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. ACL (2014)

28.

Potharaju, R., Carbunar, B., Azimpourkivi, M., Vasudevan, V., Iyengar, S.: Infiltrating social network accounts: attacks and defenses. In: Chang, C. H., Potkonjak, M. (eds.) Secure System Design and Trustable Computing, pp. 457–485. Springer, Cham (2016)

29.

Pouyanfar, S., Chen, S.C.: Semantic concept detection using weighted discretization multiple correspondence analysis for disaster information management. In: International Conference on Information Reuse and Integration, pp. 556–564 (2016)

30.

Pouyanfar, S., Chen, S.C.: Automatic video event detection for imbalance data using enhanced ensemble deep learning. Int. J. Semant. Comput. 11(01), 85–109 (2017)CrossRef

31.

Pouyanfar, S., Yang, Y., Chen, S.C., Shyu, M.L., Iyengar, S.S.: Multimedia big data analytics: a survey. ACM Comput. Surv. 51(1), 10:1–10:34 (2018)CrossRef

32.

Reyes, M.E.P., Pouyanfar, S., Zheng, H.C., Ha, H.Y., Chen, S.C.: Multimedia data management for disaster situation awareness. In: International Symposium on Sensor Networks, Systems and Security. Springer (2017)

33.

Rosenthal, S., Farra, N., Nakov, P.: SemEval-2017 task 4: sentiment analysis in Twitter. In: International Workshop on Semantic Evaluation, pp. 502–518 (2017)

34.

Scott, J.: Social network analysis. SAGE (2017)

35.

Shahbazi, H., Jamshidi, K., Monadjemi, A.H., Manoochehri, H.E.: Training oscillatory neural networks using natural gradient particle swarm optimization. Robotica 33(7), 1551–1567 (2015)CrossRef

36.

Shyu, M.L., Chen, S.C., Kashyap, R.L.: Generalized affinity-based association rule mining for multimedia database queries. Knowl. Inf. Syst. 3(3), 319–337 (2001)CrossRefMATH

37.

Shyu, M.L., Sarinnapakorn, K., Kuruppu-Appuhamilage, I., Chen, S.C., Chang, L., Goldring, T.: Handling nominal features in anomaly intrusion detection problems. In: International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications, pp. 55–62 (2005)

38.

Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: AAAI Conference on Artificial Intelligence, pp. 4278–4284. AAAI Press (2017)

39.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

40.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

41.

Takahashi, N., Gygli, M., Gool, L.V.: AENet: Learning deep audio features for video analysis. CoRR arXiv:1701.00599 (2017)

42.

Tian, H., Chen, S.C.: MCA-NN: Multiple correspondence analysis based neural network for disaster information detection. In: IEEE International Conference on Multimedia Big Data, pp. 268–275 (2017)

43.

Tian, H., Chen, S.C.: A video-aided semantic analytics system for disaster information integration. In: IEEE International Conference on Multimedia Big Data, pp. 242–243 (2017)

44.

Tian, Y., Chen, S.C., Shyu, M.L., Huang, T., Sheu, P., Del Bimbo, A.: Multimedia big data. IEEE MultiMedia 22(3), 93–95 (2015)CrossRef

45.

Tian, H., Chen, S.C., Rubin, S.H., Grefe, W.K.: FA-MCADF: Feature affinity based multiple correspondence analysis and decision fusion framework for disaster information management. In: IEEE International Conference on Information Reuse and Integration, pp. 198–206 (2017)

46.

Vosoughi, S., Vijayaraghavan, P., Roy, D.: Tweet2vec: learning tweet embeddings using character-level CNN-LSTM encoder-decoder. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1041–1044. ACM (2016)

47.

Xue, H., Liu, Y., Cai, D., He, X.: Tracking people in rgbd videos using deep learning and motion clues. Neurocomputing 204, 70–76 (2016)CrossRef

48.

Yan, Y., Zhu, Q., Shyu, M.L., Chen, S.C.: Classifier fusion by judgers on spark clusters for multimedia big data classification Qual. Softw. Through Reuse Integr., pp. 91–108. Springer, Cham (2016)

49.

Yang, Y., Lu, W., Domack, J., Li, T., Chen, S.C., Luis, S., Navlakha, J.K.: MADIS: A multimedia-aided disaster information integration system for emergency management. In: International Conference on Collaborative Computing: Networking, Applications and Worksharing, pp. 233–241. IEEE (2012)

50.

Yang, Y., Pouyanfar, S., Tian, H., Chen, M., Chen, S.C., Shyu, M.L.: IF-MCA: Importance factor-based multiple correspondence analysis for multimedia data analytics. IEEE Transactions on Multimedia (2017)

51.

Zhang, D., Wang, Y., Zhou, L., Yuan, H., Shen, D.: Multimodal classification of Alzheimer’s disease and mild cognitive impairment. Neuroimage 55(3), 856–867 (2011)CrossRef

Titel: Multimodal deep representation learning for video classification
verfasst von: Haiman Tian
Yudong Tao
Samira Pouyanfar
Shu-Ching Chen
Mei-Ling Shyu
Publikationsdatum: 03.05.2018
Verlag: Springer US
Erschienen in: World Wide Web / Ausgabe 3/2019
Print ISSN: 1386-145X
Elektronische ISSN: 1573-1413
DOI: https://doi.org/10.1007/s11280-018-0548-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2019

CoPFun: an urban co-occurrence pattern mining scheme based on regional function discovery

Editor’s Note - Special Issue on Geo-Social Computing

On successive point-of-interest recommendation

Searching activity trajectory with keywords

Discovery of accessible locations using region-based geo-social data

Augmented reality displaying scheme in a smart glass based on relative object positions and orientation sensors

Premium Partner