nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

Learning to Separate Object Sounds by Watching Unlabeled Video

verfasst von : Ruohan Gao, Rogerio Feris, Kristen Grauman

Erschienen in: Computer Vision – ECCV 2018

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale “in the wild” videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results: http://vision.cs.utexas.edu/projects/separating_object_sounds/.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Programmable Triangulation Light Curtains

Nächstes Kapitel Coded Two-Bucket Cameras for Computer Vision

Our task can hence be seen as “weakly supervised”, though the weak “labels” themselves are inferred from the video, not manually annotated.

AudioSet offers noisy video-level audio class annotations. However, we do not use any of its label information.

https://github.com/interactiveaudiolab/nussl.

http://vision.cs.utexas.edu/projects/separating_object_sounds/.

Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121 (2018)

Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. PAMI 32, 288–303 (2010)CrossRef

Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)

Arandjelović, R., Zisserman, A.: Objects that sound. arXiv preprint arXiv:1712.06651 (2017)

Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS (2016)

Aytar, Y., Vondrick, C., Torralba, A.: See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932 (2017)

Barnard, K., Duygulu, P., de Freitas, N., Blei, D., Jordan, M.: Matching words and pictures. JMLR 3, 1107–1135 (2003)MATH

Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: CVPR (2007)

Berg, T., et al.: Names and faces in the news. In: CVPR (2004)

10.

Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016)

11.

Bryan, N.: Interactive Sound Source Separation. Ph.D. thesis, Stanford University (2014)

12.

Casanovas, A.L., Monaci, G., Vandergheynst, P., Gribonval, R.: Blind audiovisual source separation based on sparse redundant representations. IEEE Trans. Multimed. 12, 358–371 (2010)CrossRef

13.

Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia (2017)

14.

Cinbis, R., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. PAMI 39, 189–203 (2017)CrossRef

15.

Darrell, T., Fisher, J.W., Viola, P.: Audio-visual segmentation and the cocktail party effect. In: Tan, T., Shi, Y., Gao, W. (eds.) ICMI 2000. LNCS, vol. 1948, pp. 32–40. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-40063-X_5CrossRef

16.

Deselaers, T., Alexe, B., Ferrari, V.: Weakly supervised localization and learning with generic knowledge. IJCV 100, 275–293 (2012)MathSciNetCrossRef

17.

Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. intell. 89, 31–71 (1997)CrossRef

18.

Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)

19.

Duong, N.Q., Ozerov, A., Chevallier, L., Sirot, J.: An interactive audio source separation framework based on non-negative matrix factorization. In: ICASSP (2014)

20.

Duong, N.Q., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18, 1830–1840 (2010)CrossRef

21.

Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47979-1_7CrossRef

22.

Ellis, D.P.W.: Prediction-driven computational auditory scene analysis. Ph.D. thesis, Massachusetts Institute of Technology (1996)

23.

Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018)

24.

Feng, J., Zhou, Z.H.: Deep MIML network. In: AAAI (2017)

25.

Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the itakura-saito divergence: with application to music analysis. Neural comput. 21, 793–830 (2009)CrossRef

26.

Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural comput. 23, 2421–2456 (2011)MathSciNetCrossRef

27.

Fisher III, J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: NIPS (2001)

28.

Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement using noise-invariant training. arXiv preprint arXiv:1711.08789 (2017)

29.

Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)

30.

Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)CrossRef

31.

Guo, X., Uhlich, S., Mitsufuji, Y.: NMF-based blind source separation using a linear predictive coding error clustering criterion. In: ICASSP (2015)

32.

Harwath, D., Glass, J.: Learning word-like units from joint audio-visual analysis. In: ACL (2017)

33.

Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. arXiv preprint arXiv:1804.01452 (2018)

34.

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

35.

Hennequin, R., David, B., Badeau, R.: Score informed audio source separation using a parametric model of non-negative spectrogram. In: ICASSP (2011)

36.

Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: ICASSP (2016)

37.

Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NIPS (2000)

38.

Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval (1999)

39.

Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: ICASSP (2014)

40.

Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000)CrossRef

41.

Innami, S., Kasai, H.: NMF-based environmental sound source separation using time-variant gain features. Comput. Math. Appl. 64, 1333–1342 (2012)CrossRef

42.

Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)

43.

Izadinia, H., Saleemi, I., Shah, M.: Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Trans. Multimed. 15, 378–390 (2013)CrossRef

44.

Jaiswal, R., FitzGerald, D., Barry, D., Coyle, E., Rickard, S.: Clustering NMF basis functions using shifted NMF for monaural sound source separation. In: ICASSP (2011)

45.

Jhuo, I.H., Ye, G., Gao, S., Liu, D., Jiang, Y.G., Lee, D., Chang, S.F.: Discovering joint audio-visual codewords for video event detection. Machine Vis. Appl. 25, 33–47 (2014)CrossRef

46.

Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)

47.

Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: CVPR (2005)

48.

Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

49.

Korbar, B., Tran, D., Torresani, L.: Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:1807.00230 (2018)

50.

Le Magoarou, L., Ozerov, A., Duong, N.Q.: Text-informed audio source separation. Example-based approach using non-negative matrix partial co-factorization. J. Signal Process. Syst. 79, 117–131 (2015)CrossRef

51.

Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems (2001)

52.

Li, B., Dinesh, K., Duan, Z., Sharma, G.: See and listen: score-informed association of sound tracks to players in chamber music performance videos. In: ICASSP (2017)

53.

Li, K., Ye, J., Hua, K.A.: What’s making that sound? In: ACMMM (2014)

54.

Liutkus, A., Fitzgerald, D., Rafii, Z., Pardo, B., Daudet, L.: Kernel additive models for source separation. IEEE Trans. Signal Process. 62, 4298–4310 (2014)MathSciNetCrossRef

55.

Lock, E.F., Hoadley, K.A., Marron, J.S., Nobel, A.B.: Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7(1), 523 (2013)MathSciNetCrossRef

56.

Nakadai, K., Hidai, K.I., Okuno, H.G., Kitano, H.: Real-time speaker localization and speech separation by audio-visual integration. In: IEEE International Conference on Robotics and Automation (2002)

57.

Naphade, M., Smith, J.R., Tesic, J., Chang, S.F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale concept ontology for multimedia. IEEE Multimed. 13, 86–91 (2006)CrossRef

58.

Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641 (2018)

59.

Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)

60.

Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48CrossRef

61.

Parekh, S., Essid, S., Ozerov, A., Duong, N.Q., Pérez, P., Richard, G.: Motion informed audio source separation. In: ICASSP (2017)

62.

Pu, J., Panagakis, Y., Petridis, S., Pantic, M.: Audio-visual object localization and separation using low-rank and sparsity. In: ICASSP (2017)

63.

Rahne, T., Böckmann, M., von Specht, H., Sussman, E.S.: Visual cues can modulate integration and segregation of objects in auditory scene analysis. Brain Res. 1144, 127–135 (2007)CrossRef

64.

Rivet, B., Girin, L., Jutten, C.: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 15, 96–108 (2007)CrossRef

65.

Sedighin, F., Babaie-Zadeh, M., Rivet, B., Jutten, C.: Two multimodal approaches for single microphone source separation. In: 24th European Signal Processing Conference (2016)

66.

Simpson, A.J.R., Roma, G., Plumbley, M.D.: Deep karaoke: extracting vocals from musical mixtures using a convolutional deep neural network. In: Vincent, E., Yeredor, A., Koldovský, Z., Tichavský, P. (eds.) LVA/ICA 2015. LNCS, vol. 9237, pp. 429–436. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22482-4_50CrossRef

67.

Smaragdis, P., Casey, M.: Audio/visual independent components. In: International Conference on Independent Component Analysis and Signal Separation (2003)

68.

Smaragdis, P., Raj, B., Shashanka, M.: A probabilistic latent variable model for acoustic modeling. In: NIPS (2006)

69.

Smaragdis, P., Raj, B., Shashanka, M.: Supervised and semi-supervised separation of sounds from single-channel mixtures. In: Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007. LNCS, vol. 4666, pp. 414–421. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74494-8_52CrossRefMATH

70.

Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (2006)

71.

Snoek, C.G., Worring, M.: Multimodal video indexing: a review of the state-of-the-art. Multimed. Tools Appl. 25, 5–35 (2005)CrossRef

72.

SPIERTZ, M.: Source-filter based clustering for monaural blind source separation. In: 12th International Conference on Digital Audio Effects (2009)

73.

Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: multiple-instance learning for weakly supervised object categorization. In: CVPR (2008)

74.

Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 1462–1469 (2006)CrossRef

75.

Virtanen, T.: Sound source separation using sparse coding with temporal continuity objective. In: International Computer Music Conference (2003)

76.

Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15, 1066–1074 (2007)CrossRef

77.

Wang, B.: Investigating single-channel audio source separation methods based on non-negative matrix factorization. In: ICA Research Network International Workshop (2006)

78.

Wang, L., Xiong, Y., Lin, D., Gool, L.V.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR (2017)

79.

Wang, Z., et al.: Truly multi-modal YouTube-8m video classification with video, audio, and text. arXiv preprint arXiv:1706.05461 (2017)

80.

Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: CVPR (2015)

81.

Yang, H., Zhou, J.T., Cai, J., Ong, Y.S.: MIML-FCN+: multi-instance multi-label learning via fully convolutional networks with privileged information. In: CVPR (2017)

82.

Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52, 1830–1847 (2004)MathSciNetCrossRef

83.

Zhang, Z., et al.: Generative modeling of audible shapes for object perception. In: ICCV (2017)

84.

Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. arXiv preprint arXiv:1804.03160 (2018)

85.

Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. arXiv preprint arXiv:1712.01393 (2017)

86.

Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computat. 13, 863–882 (2001)CrossRef

Titel: Learning to Separate Object Sounds by Watching Unlabeled Video
verfasst von: Ruohan Gao
Rogerio Feris
Kristen Grauman
Verlag: Springer International Publishing
Buch: Computer Vision – ECCV 2018
Print ISBN: 978-3-030-01218-2

Electronic ISBN: 978-3-030-01219-9

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-030-01219-9_3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"