Skip to main content

2018 | OriginalPaper | Buchkapitel

Learning to Separate Object Sounds by Watching Unlabeled Video

verfasst von : Ruohan Gao, Rogerio Feris, Kristen Grauman

Erschienen in: Computer Vision – ECCV 2018

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale “in the wild” videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results: http://​vision.​cs.​utexas.​edu/​projects/​separating_​object_​sounds/​.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Our task can hence be seen as “weakly supervised”, though the weak “labels” themselves are inferred from the video, not manually annotated.
 
2
AudioSet offers noisy video-level audio class annotations. However, we do not use any of its label information.
 
Literatur
1.
Zurück zum Zitat Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121 (2018) Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. arXiv preprint arXiv:​1804.​04121 (2018)
2.
Zurück zum Zitat Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. PAMI 32, 288–303 (2010)CrossRef Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. PAMI 32, 288–303 (2010)CrossRef
3.
Zurück zum Zitat Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017) Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
5.
Zurück zum Zitat Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS (2016) Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS (2016)
6.
7.
Zurück zum Zitat Barnard, K., Duygulu, P., de Freitas, N., Blei, D., Jordan, M.: Matching words and pictures. JMLR 3, 1107–1135 (2003)MATH Barnard, K., Duygulu, P., de Freitas, N., Blei, D., Jordan, M.: Matching words and pictures. JMLR 3, 1107–1135 (2003)MATH
8.
Zurück zum Zitat Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: CVPR (2007) Barzelay, Z., Schechner, Y.Y.: Harmony in motion. In: CVPR (2007)
9.
Zurück zum Zitat Berg, T., et al.: Names and faces in the news. In: CVPR (2004) Berg, T., et al.: Names and faces in the news. In: CVPR (2004)
10.
Zurück zum Zitat Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016) Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016)
11.
Zurück zum Zitat Bryan, N.: Interactive Sound Source Separation. Ph.D. thesis, Stanford University (2014) Bryan, N.: Interactive Sound Source Separation. Ph.D. thesis, Stanford University (2014)
12.
Zurück zum Zitat Casanovas, A.L., Monaci, G., Vandergheynst, P., Gribonval, R.: Blind audiovisual source separation based on sparse redundant representations. IEEE Trans. Multimed. 12, 358–371 (2010)CrossRef Casanovas, A.L., Monaci, G., Vandergheynst, P., Gribonval, R.: Blind audiovisual source separation based on sparse redundant representations. IEEE Trans. Multimed. 12, 358–371 (2010)CrossRef
13.
Zurück zum Zitat Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia (2017) Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia (2017)
14.
Zurück zum Zitat Cinbis, R., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. PAMI 39, 189–203 (2017)CrossRef Cinbis, R., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. PAMI 39, 189–203 (2017)CrossRef
16.
Zurück zum Zitat Deselaers, T., Alexe, B., Ferrari, V.: Weakly supervised localization and learning with generic knowledge. IJCV 100, 275–293 (2012)MathSciNetCrossRef Deselaers, T., Alexe, B., Ferrari, V.: Weakly supervised localization and learning with generic knowledge. IJCV 100, 275–293 (2012)MathSciNetCrossRef
17.
Zurück zum Zitat Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. intell. 89, 31–71 (1997)CrossRef Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. intell. 89, 31–71 (1997)CrossRef
18.
Zurück zum Zitat Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015) Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)
19.
Zurück zum Zitat Duong, N.Q., Ozerov, A., Chevallier, L., Sirot, J.: An interactive audio source separation framework based on non-negative matrix factorization. In: ICASSP (2014) Duong, N.Q., Ozerov, A., Chevallier, L., Sirot, J.: An interactive audio source separation framework based on non-negative matrix factorization. In: ICASSP (2014)
20.
Zurück zum Zitat Duong, N.Q., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18, 1830–1840 (2010)CrossRef Duong, N.Q., Vincent, E., Gribonval, R.: Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18, 1830–1840 (2010)CrossRef
21.
Zurück zum Zitat Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47979-1_7CrossRef Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002). https://​doi.​org/​10.​1007/​3-540-47979-1_​7CrossRef
22.
Zurück zum Zitat Ellis, D.P.W.: Prediction-driven computational auditory scene analysis. Ph.D. thesis, Massachusetts Institute of Technology (1996) Ellis, D.P.W.: Prediction-driven computational auditory scene analysis. Ph.D. thesis, Massachusetts Institute of Technology (1996)
23.
Zurück zum Zitat Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 (2018) Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:​1804.​03619 (2018)
24.
Zurück zum Zitat Feng, J., Zhou, Z.H.: Deep MIML network. In: AAAI (2017) Feng, J., Zhou, Z.H.: Deep MIML network. In: AAAI (2017)
25.
Zurück zum Zitat Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the itakura-saito divergence: with application to music analysis. Neural comput. 21, 793–830 (2009)CrossRef Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the itakura-saito divergence: with application to music analysis. Neural comput. 21, 793–830 (2009)CrossRef
26.
Zurück zum Zitat Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural comput. 23, 2421–2456 (2011)MathSciNetCrossRef Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural comput. 23, 2421–2456 (2011)MathSciNetCrossRef
27.
Zurück zum Zitat Fisher III, J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: NIPS (2001) Fisher III, J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: NIPS (2001)
28.
Zurück zum Zitat Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement using noise-invariant training. arXiv preprint arXiv:1711.08789 (2017) Gabbay, A., Shamir, A., Peleg, S.: Visual speech enhancement using noise-invariant training. arXiv preprint arXiv:​1711.​08789 (2017)
29.
Zurück zum Zitat Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017) Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)
30.
Zurück zum Zitat Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)CrossRef Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)CrossRef
31.
Zurück zum Zitat Guo, X., Uhlich, S., Mitsufuji, Y.: NMF-based blind source separation using a linear predictive coding error clustering criterion. In: ICASSP (2015) Guo, X., Uhlich, S., Mitsufuji, Y.: NMF-based blind source separation using a linear predictive coding error clustering criterion. In: ICASSP (2015)
32.
Zurück zum Zitat Harwath, D., Glass, J.: Learning word-like units from joint audio-visual analysis. In: ACL (2017) Harwath, D., Glass, J.: Learning word-like units from joint audio-visual analysis. In: ACL (2017)
33.
Zurück zum Zitat Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. arXiv preprint arXiv:1804.01452 (2018) Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. arXiv preprint arXiv:​1804.​01452 (2018)
34.
Zurück zum Zitat He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
35.
Zurück zum Zitat Hennequin, R., David, B., Badeau, R.: Score informed audio source separation using a parametric model of non-negative spectrogram. In: ICASSP (2011) Hennequin, R., David, B., Badeau, R.: Score informed audio source separation using a parametric model of non-negative spectrogram. In: ICASSP (2011)
36.
Zurück zum Zitat Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: ICASSP (2016) Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: ICASSP (2016)
37.
Zurück zum Zitat Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NIPS (2000) Hershey, J.R., Movellan, J.R.: Audio vision: using audio-visual synchrony to locate sounds. In: NIPS (2000)
38.
Zurück zum Zitat Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval (1999) Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval (1999)
39.
Zurück zum Zitat Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: ICASSP (2014) Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P.: Deep learning for monaural speech separation. In: ICASSP (2014)
40.
Zurück zum Zitat Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000)CrossRef Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000)CrossRef
41.
Zurück zum Zitat Innami, S., Kasai, H.: NMF-based environmental sound source separation using time-variant gain features. Comput. Math. Appl. 64, 1333–1342 (2012)CrossRef Innami, S., Kasai, H.: NMF-based environmental sound source separation using time-variant gain features. Comput. Math. Appl. 64, 1333–1342 (2012)CrossRef
42.
Zurück zum Zitat Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015) Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
43.
Zurück zum Zitat Izadinia, H., Saleemi, I., Shah, M.: Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Trans. Multimed. 15, 378–390 (2013)CrossRef Izadinia, H., Saleemi, I., Shah, M.: Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Trans. Multimed. 15, 378–390 (2013)CrossRef
44.
Zurück zum Zitat Jaiswal, R., FitzGerald, D., Barry, D., Coyle, E., Rickard, S.: Clustering NMF basis functions using shifted NMF for monaural sound source separation. In: ICASSP (2011) Jaiswal, R., FitzGerald, D., Barry, D., Coyle, E., Rickard, S.: Clustering NMF basis functions using shifted NMF for monaural sound source separation. In: ICASSP (2011)
45.
Zurück zum Zitat Jhuo, I.H., Ye, G., Gao, S., Liu, D., Jiang, Y.G., Lee, D., Chang, S.F.: Discovering joint audio-visual codewords for video event detection. Machine Vis. Appl. 25, 33–47 (2014)CrossRef Jhuo, I.H., Ye, G., Gao, S., Liu, D., Jiang, Y.G., Lee, D., Chang, S.F.: Discovering joint audio-visual codewords for video event detection. Machine Vis. Appl. 25, 33–47 (2014)CrossRef
46.
Zurück zum Zitat Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015) Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)
47.
Zurück zum Zitat Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: CVPR (2005) Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. In: CVPR (2005)
48.
Zurück zum Zitat Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015) Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
49.
Zurück zum Zitat Korbar, B., Tran, D., Torresani, L.: Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:1807.00230 (2018) Korbar, B., Tran, D., Torresani, L.: Co-training of audio and video representations from self-supervised temporal synchronization. arXiv preprint arXiv:​1807.​00230 (2018)
50.
Zurück zum Zitat Le Magoarou, L., Ozerov, A., Duong, N.Q.: Text-informed audio source separation. Example-based approach using non-negative matrix partial co-factorization. J. Signal Process. Syst. 79, 117–131 (2015)CrossRef Le Magoarou, L., Ozerov, A., Duong, N.Q.: Text-informed audio source separation. Example-based approach using non-negative matrix partial co-factorization. J. Signal Process. Syst. 79, 117–131 (2015)CrossRef
51.
Zurück zum Zitat Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems (2001) Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems (2001)
52.
Zurück zum Zitat Li, B., Dinesh, K., Duan, Z., Sharma, G.: See and listen: score-informed association of sound tracks to players in chamber music performance videos. In: ICASSP (2017) Li, B., Dinesh, K., Duan, Z., Sharma, G.: See and listen: score-informed association of sound tracks to players in chamber music performance videos. In: ICASSP (2017)
53.
Zurück zum Zitat Li, K., Ye, J., Hua, K.A.: What’s making that sound? In: ACMMM (2014) Li, K., Ye, J., Hua, K.A.: What’s making that sound? In: ACMMM (2014)
54.
Zurück zum Zitat Liutkus, A., Fitzgerald, D., Rafii, Z., Pardo, B., Daudet, L.: Kernel additive models for source separation. IEEE Trans. Signal Process. 62, 4298–4310 (2014)MathSciNetCrossRef Liutkus, A., Fitzgerald, D., Rafii, Z., Pardo, B., Daudet, L.: Kernel additive models for source separation. IEEE Trans. Signal Process. 62, 4298–4310 (2014)MathSciNetCrossRef
55.
Zurück zum Zitat Lock, E.F., Hoadley, K.A., Marron, J.S., Nobel, A.B.: Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7(1), 523 (2013)MathSciNetCrossRef Lock, E.F., Hoadley, K.A., Marron, J.S., Nobel, A.B.: Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7(1), 523 (2013)MathSciNetCrossRef
56.
Zurück zum Zitat Nakadai, K., Hidai, K.I., Okuno, H.G., Kitano, H.: Real-time speaker localization and speech separation by audio-visual integration. In: IEEE International Conference on Robotics and Automation (2002) Nakadai, K., Hidai, K.I., Okuno, H.G., Kitano, H.: Real-time speaker localization and speech separation by audio-visual integration. In: IEEE International Conference on Robotics and Automation (2002)
57.
Zurück zum Zitat Naphade, M., Smith, J.R., Tesic, J., Chang, S.F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale concept ontology for multimedia. IEEE Multimed. 13, 86–91 (2006)CrossRef Naphade, M., Smith, J.R., Tesic, J., Chang, S.F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale concept ontology for multimedia. IEEE Multimed. 13, 86–91 (2006)CrossRef
58.
Zurück zum Zitat Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641 (2018) Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:​1804.​03641 (2018)
59.
Zurück zum Zitat Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016) Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)
61.
Zurück zum Zitat Parekh, S., Essid, S., Ozerov, A., Duong, N.Q., Pérez, P., Richard, G.: Motion informed audio source separation. In: ICASSP (2017) Parekh, S., Essid, S., Ozerov, A., Duong, N.Q., Pérez, P., Richard, G.: Motion informed audio source separation. In: ICASSP (2017)
62.
Zurück zum Zitat Pu, J., Panagakis, Y., Petridis, S., Pantic, M.: Audio-visual object localization and separation using low-rank and sparsity. In: ICASSP (2017) Pu, J., Panagakis, Y., Petridis, S., Pantic, M.: Audio-visual object localization and separation using low-rank and sparsity. In: ICASSP (2017)
63.
Zurück zum Zitat Rahne, T., Böckmann, M., von Specht, H., Sussman, E.S.: Visual cues can modulate integration and segregation of objects in auditory scene analysis. Brain Res. 1144, 127–135 (2007)CrossRef Rahne, T., Böckmann, M., von Specht, H., Sussman, E.S.: Visual cues can modulate integration and segregation of objects in auditory scene analysis. Brain Res. 1144, 127–135 (2007)CrossRef
64.
Zurück zum Zitat Rivet, B., Girin, L., Jutten, C.: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 15, 96–108 (2007)CrossRef Rivet, B., Girin, L., Jutten, C.: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 15, 96–108 (2007)CrossRef
65.
Zurück zum Zitat Sedighin, F., Babaie-Zadeh, M., Rivet, B., Jutten, C.: Two multimodal approaches for single microphone source separation. In: 24th European Signal Processing Conference (2016) Sedighin, F., Babaie-Zadeh, M., Rivet, B., Jutten, C.: Two multimodal approaches for single microphone source separation. In: 24th European Signal Processing Conference (2016)
66.
67.
Zurück zum Zitat Smaragdis, P., Casey, M.: Audio/visual independent components. In: International Conference on Independent Component Analysis and Signal Separation (2003) Smaragdis, P., Casey, M.: Audio/visual independent components. In: International Conference on Independent Component Analysis and Signal Separation (2003)
68.
Zurück zum Zitat Smaragdis, P., Raj, B., Shashanka, M.: A probabilistic latent variable model for acoustic modeling. In: NIPS (2006) Smaragdis, P., Raj, B., Shashanka, M.: A probabilistic latent variable model for acoustic modeling. In: NIPS (2006)
70.
Zurück zum Zitat Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (2006) Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (2006)
71.
Zurück zum Zitat Snoek, C.G., Worring, M.: Multimodal video indexing: a review of the state-of-the-art. Multimed. Tools Appl. 25, 5–35 (2005)CrossRef Snoek, C.G., Worring, M.: Multimodal video indexing: a review of the state-of-the-art. Multimed. Tools Appl. 25, 5–35 (2005)CrossRef
72.
Zurück zum Zitat SPIERTZ, M.: Source-filter based clustering for monaural blind source separation. In: 12th International Conference on Digital Audio Effects (2009) SPIERTZ, M.: Source-filter based clustering for monaural blind source separation. In: 12th International Conference on Digital Audio Effects (2009)
73.
Zurück zum Zitat Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: multiple-instance learning for weakly supervised object categorization. In: CVPR (2008) Vijayanarasimhan, S., Grauman, K.: Keywords to visual categories: multiple-instance learning for weakly supervised object categorization. In: CVPR (2008)
74.
Zurück zum Zitat Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 1462–1469 (2006)CrossRef Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 1462–1469 (2006)CrossRef
75.
Zurück zum Zitat Virtanen, T.: Sound source separation using sparse coding with temporal continuity objective. In: International Computer Music Conference (2003) Virtanen, T.: Sound source separation using sparse coding with temporal continuity objective. In: International Computer Music Conference (2003)
76.
Zurück zum Zitat Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15, 1066–1074 (2007)CrossRef Virtanen, T.: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15, 1066–1074 (2007)CrossRef
77.
Zurück zum Zitat Wang, B.: Investigating single-channel audio source separation methods based on non-negative matrix factorization. In: ICA Research Network International Workshop (2006) Wang, B.: Investigating single-channel audio source separation methods based on non-negative matrix factorization. In: ICA Research Network International Workshop (2006)
78.
Zurück zum Zitat Wang, L., Xiong, Y., Lin, D., Gool, L.V.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR (2017) Wang, L., Xiong, Y., Lin, D., Gool, L.V.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR (2017)
79.
Zurück zum Zitat Wang, Z., et al.: Truly multi-modal YouTube-8m video classification with video, audio, and text. arXiv preprint arXiv:1706.05461 (2017) Wang, Z., et al.: Truly multi-modal YouTube-8m video classification with video, audio, and text. arXiv preprint arXiv:​1706.​05461 (2017)
80.
Zurück zum Zitat Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: CVPR (2015) Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: CVPR (2015)
81.
Zurück zum Zitat Yang, H., Zhou, J.T., Cai, J., Ong, Y.S.: MIML-FCN+: multi-instance multi-label learning via fully convolutional networks with privileged information. In: CVPR (2017) Yang, H., Zhou, J.T., Cai, J., Ong, Y.S.: MIML-FCN+: multi-instance multi-label learning via fully convolutional networks with privileged information. In: CVPR (2017)
82.
Zurück zum Zitat Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52, 1830–1847 (2004)MathSciNetCrossRef Yilmaz, O., Rickard, S.: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52, 1830–1847 (2004)MathSciNetCrossRef
83.
Zurück zum Zitat Zhang, Z., et al.: Generative modeling of audible shapes for object perception. In: ICCV (2017) Zhang, Z., et al.: Generative modeling of audible shapes for object perception. In: ICCV (2017)
84.
Zurück zum Zitat Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. arXiv preprint arXiv:1804.03160 (2018) Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. arXiv preprint arXiv:​1804.​03160 (2018)
85.
Zurück zum Zitat Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. arXiv preprint arXiv:1712.01393 (2017) Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. arXiv preprint arXiv:​1712.​01393 (2017)
86.
Zurück zum Zitat Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computat. 13, 863–882 (2001)CrossRef Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Computat. 13, 863–882 (2001)CrossRef
Metadaten
Titel
Learning to Separate Object Sounds by Watching Unlabeled Video
verfasst von
Ruohan Gao
Rogerio Feris
Kristen Grauman
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-01219-9_3