Skip to main content
Top

2018 | OriginalPaper | Chapter

14. Audio-Visual Source Separation with Alternating Diffusion Maps

Authors : David Dov, Ronen Talmon, Israel Cohen

Published in: Audio Source Separation

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this chapter we consider the separation of multiple sound sources of different types including multiple speakers and transients, which are measured by a single microphone and by a video camera. We address the problem of separating a particular sound source from all other sources focusing specifically on obtaining an underlying representation of it while attenuating all other sources. By pointing the video camera merely to the desired sound source, the problem becomes equivalent to extracting the common source to the audio and the video modalities while ignoring the other sources. We use a kernel-based method, which is particularly designed for this task, providing an underlying representation of the common source. We demonstrate the usefulness of the obtained representation for the activity detection of the common source and discuss how it may be further used for source separation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference R.R. Lederman, R. Talmon, Learning the geometry of common latent variables using alternating-diffusion. Appl. Comput. Harmon. Anal. (2015) R.R. Lederman, R. Talmon, Learning the geometry of common latent variables using alternating-diffusion. Appl. Comput. Harmon. Anal. (2015)
2.
go back to reference S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)CrossRef S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)CrossRef
3.
go back to reference M. Balasubramanian, E.L. Schwartz, J.B. Tenenbaum, V. de Silva, J.C. Langford, The isomap algorithm and topological stability. Science 295(5552), 7–7 (2002) M. Balasubramanian, E.L. Schwartz, J.B. Tenenbaum, V. de Silva, J.C. Langford, The isomap algorithm and topological stability. Science 295(5552), 7–7 (2002)
4.
go back to reference M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)CrossRefMATH M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)CrossRefMATH
5.
go back to reference D.L. Donoho, C. Grimes, Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc. Nat. Acad. Sci. 100(10), 5591–5596 (2003)MathSciNetCrossRefMATH D.L. Donoho, C. Grimes, Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc. Nat. Acad. Sci. 100(10), 5591–5596 (2003)MathSciNetCrossRefMATH
7.
go back to reference D. Zhou, C.J.C. Burges, Spectral clustering and transductive learning with multiple views, in Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA (2007), pp. 1159–1166 D. Zhou, C.J.C. Burges, Spectral clustering and transductive learning with multiple views, in Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA (2007), pp. 1159–1166
8.
go back to reference M.B. Blaschko, C.H. Lampert, Correlational spectral clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK (2008), pp. 1–8 M.B. Blaschko, C.H. Lampert, Correlational spectral clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK (2008), pp. 1–8
9.
go back to reference V.R. De Sa, P.W. Gallagher, J.M. Lewis, V.L. Malave, Multi-view kernel construction. Mach. Learn. 79(1–2), 47–71 (2010)MathSciNet V.R. De Sa, P.W. Gallagher, J.M. Lewis, V.L. Malave, Multi-view kernel construction. Mach. Learn. 79(1–2), 47–71 (2010)MathSciNet
10.
go back to reference A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, Adv. Neural Inf. Process. Syst., 1413–1421 (2011) A. Kumar, P. Rai, H. Daume, Co-regularized multi-view spectral clustering, Adv. Neural Inf. Process. Syst., 1413–1421 (2011)
11.
go back to reference A. Kumar, H. Daumé, A co-training approach for multi-view spectral clustering, in Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA (2011), pp. 393–400 A. Kumar, H. Daumé, A co-training approach for multi-view spectral clustering, in Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, Washington, USA (2011), pp. 393–400
12.
go back to reference Y.Y. Lin, T.L. Liu, C.S. Fuh, Multiple kernel learning for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell. 33(6), 1147–1160 (2011)CrossRef Y.Y. Lin, T.L. Liu, C.S. Fuh, Multiple kernel learning for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell. 33(6), 1147–1160 (2011)CrossRef
13.
go back to reference B. Wang, J. Jiang, W. Wang, Z.H. Zhou, Z. Tu, Unsupervised metric fusion by cross diffusion, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI (2012), pp. 2997–3004 B. Wang, J. Jiang, W. Wang, Z.H. Zhou, Z. Tu, Unsupervised metric fusion by cross diffusion, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI (2012), pp. 2997–3004
14.
go back to reference H.C. Huang, Y.Y. Chuang, C.S. Chen, Affinity aggregation for spectral clustering, in Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI (2012), pp. 773–780 H.C. Huang, Y.Y. Chuang, C.S. Chen, Affinity aggregation for spectral clustering, in Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI (2012), pp. 773–780
15.
go back to reference B. Boots, G. Gordon, Two-manifold problems with applications to nonlinear system identification, in Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland, GB (2012), pp. 623–630 B. Boots, G. Gordon, Two-manifold problems with applications to nonlinear system identification, in Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland, GB (2012), pp. 623–630
17.
18.
go back to reference T. Michaeli, W. Wang, T. Livescu, Nonparametric canonical correlation analysis, in Proceedings of the International Conference on Machine Learning (ICML), New York, USA (2016) T. Michaeli, W. Wang, T. Livescu, Nonparametric canonical correlation analysis, in Proceedings of the International Conference on Machine Learning (ICML), New York, USA (2016)
19.
go back to reference A. Aubrey, B. Rivet, Y. Hicks, L. Girin, J. Chambers, C. Jutten, Two novel visual voice activity detectors based on appearance models and retinal filltering, Proceedings of the 15th European Signal Processing Conference (EUSIPCO) (2007), pp. 2409–2413 A. Aubrey, B. Rivet, Y. Hicks, L. Girin, J. Chambers, C. Jutten, Two novel visual voice activity detectors based on appearance models and retinal filltering, Proceedings of the 15th European Signal Processing Conference (EUSIPCO) (2007), pp. 2409–2413
20.
go back to reference E. Ong, R. Bowden, Robust lip-tracking using rigid flocks of selected linear predictors, Proceedings of the 8th IEEE International Conference on Automatic Face and Gesture Recognition (2008) E. Ong, R. Bowden, Robust lip-tracking using rigid flocks of selected linear predictors, Proceedings of the 8th IEEE International Conference on Automatic Face and Gesture Recognition (2008)
21.
go back to reference Q. Liu, W. Wang, P. Jackson, A visual voice activity detection method with adaboosting, in Proceedings of the Sensor Signal Processing for Defence (SSPD) (IET, 2011), pp. 1–5 Q. Liu, W. Wang, P. Jackson, A visual voice activity detection method with adaboosting, in Proceedings of the Sensor Signal Processing for Defence (SSPD) (IET, 2011), pp. 1–5
22.
go back to reference D. Sodoyer, B. Rivet, L. Girin, J. Schwartz, C. Jutten, An analysis of visual speech information applied to voice activity detection, Proceedings of the 31st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1 (2006) D. Sodoyer, B. Rivet, L. Girin, J. Schwartz, C. Jutten, An analysis of visual speech information applied to voice activity detection, Proceedings of the 31st IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1 (2006)
23.
go back to reference D. Sodoyer, B. Rivet, L. Girin, C. Savariaux, J. Schwartz, C. Jutten, A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125, 1184 (2009)CrossRef D. Sodoyer, B. Rivet, L. Girin, C. Savariaux, J. Schwartz, C. Jutten, A study of lip movements during spontaneous dialog and its application to voice activity detection. J. Acoust. Soc. Am. 125, 1184 (2009)CrossRef
24.
go back to reference S. Siatras, N. Nikolaidis, M. Krinidis, I. Pitas, Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)CrossRef S. Siatras, N. Nikolaidis, M. Krinidis, I. Pitas, Visual lip activity detection and speaker detection using mouth region intensities. IEEE Trans. Circuits Syst. Video Technol. 19(1), 133–137 (2009)CrossRef
25.
go back to reference A. Aubrey, Y. Hicks, J. Chambers, Visual voice activity detection with optical flow. IET Image Proc. 4(6), 463–472 (2010)CrossRef A. Aubrey, Y. Hicks, J. Chambers, Visual voice activity detection with optical flow. IET Image Proc. 4(6), 463–472 (2010)CrossRef
26.
go back to reference P. Tiawongsombat, M. Jeong, J. Yun, B. You, S. Oh, Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)CrossRef P. Tiawongsombat, M. Jeong, J. Yun, B. You, S. Oh, Robust visual speakingness detection using bi-level HMM. Pattern Recogn. 45(2), 783–793 (2012)CrossRef
27.
go back to reference P. Atrey, M. Hossain, A. El Saddik, M. Kankanhalli, Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)CrossRef P. Atrey, M. Hossain, A. El Saddik, M. Kankanhalli, Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)CrossRef
28.
go back to reference S. Tamura, M. Ishikawa, T. Hashiba, S. Takeuchi, S. Hayamizu, A robust audio-visual speech recognition using audio-visual voice activity detection, in Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2010), pp. 2694–2697 S. Tamura, M. Ishikawa, T. Hashiba, S. Takeuchi, S. Hayamizu, A robust audio-visual speech recognition using audio-visual voice activity detection, in Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2010), pp. 2694–2697
29.
go back to reference D. Dov, R. Talmon, I. Cohen, Audio-visual voice activity detection using diffusion maps. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 732–745 (2015)CrossRef D. Dov, R. Talmon, I. Cohen, Audio-visual voice activity detection using diffusion maps. IEEE/ACM Trans. Audio Speech Lang. Process. 23(4), 732–745 (2015)CrossRef
30.
go back to reference R. Talmon, I. Cohen, S. Gannot, R.R. Coifman, Supervised graph-based processing for sequential transient interference suppression. IEEE Trans. Audio Speech Lang. Process. 20(9), 2528–2538 (2012) R. Talmon, I. Cohen, S. Gannot, R.R. Coifman, Supervised graph-based processing for sequential transient interference suppression. IEEE Trans. Audio Speech Lang. Process. 20(9), 2528–2538 (2012)
31.
go back to reference A. Hirszhorn, D. Dov, R. Talmon, I. Cohen, Transient interference suppression in speech signals based on the OM-LSA algorithm, Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC) (2012), pp. 1–4 A. Hirszhorn, D. Dov, R. Talmon, I. Cohen, Transient interference suppression in speech signals based on the OM-LSA algorithm, Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC) (2012), pp. 1–4
32.
go back to reference R. Talmon, I. Cohen, S. Gannot, Clustering and suppression of transient noise in speech signals using diffusion maps, in Proceedings of the 36th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2011), pp. 5084–5087 R. Talmon, I. Cohen, S. Gannot, Clustering and suppression of transient noise in speech signals using diffusion maps, in Proceedings of the 36th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2011), pp. 5084–5087
33.
go back to reference D. Dov, R. Talmon, I. Cohen, Kernel-based sensor fusion with application to audio-visual voice activity detection. IEEE Trans. Signal Process. 64(24), 6406–6416 (2016)MathSciNetCrossRef D. Dov, R. Talmon, I. Cohen, Kernel-based sensor fusion with application to audio-visual voice activity detection. IEEE Trans. Signal Process. 64(24), 6406–6416 (2016)MathSciNetCrossRef
34.
go back to reference D. Dov, R. Talmon, I. Cohen, Kernel method for voice activity detection in the presence of transients. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2313–2326 (2016) D. Dov, R. Talmon, I. Cohen, Kernel method for voice activity detection in the presence of transients. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2313–2326 (2016)
35.
go back to reference P.C. Mahalanobis, On the generalized distance in statistics. Proc. Nat. Inst. Sci. (Calcutta) 2, 49–55 (1936)MATH P.C. Mahalanobis, On the generalized distance in statistics. Proc. Nat. Inst. Sci. (Calcutta) 2, 49–55 (1936)MATH
36.
go back to reference C. Fowlkes, S. Belongie, F. Chung, J. Malik, Spectral grouping using the Nyström method. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 214–225 (2004)CrossRef C. Fowlkes, S. Belongie, F. Chung, J. Malik, Spectral grouping using the Nyström method. IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 214–225 (2004)CrossRef
37.
go back to reference J. Shi, J. Malik, Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)CrossRef J. Shi, J. Malik, Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)CrossRef
39.
go back to reference J. Barron, D. Fleet, S. Beauchemin, Performance of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994)CrossRef J. Barron, D. Fleet, S. Beauchemin, Performance of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994)CrossRef
40.
go back to reference A. Bruhn, J. Weickert, C. Schnörr, Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61(3), 211–231 (2005)CrossRef A. Bruhn, J. Weickert, C. Schnörr, Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods. Int. J. Comput. Vis. 61(3), 211–231 (2005)CrossRef
41.
go back to reference S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRef S.B. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRef
42.
go back to reference H. Hirsch, D. Pearce, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, ASR2000-Automatic Speech Recognition: Challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW) (2000) H. Hirsch, D. Pearce, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, ASR2000-Automatic Speech Recognition: Challenges for the New Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)
43.
go back to reference B. Logan, Mel frequency cepstral coefficients for music modeling, Proceedings of the 1st International Conference on Music Information Retrieval (ISMIR) (2000) B. Logan, Mel frequency cepstral coefficients for music modeling, Proceedings of the 1st International Conference on Music Information Retrieval (ISMIR) (2000)
44.
go back to reference R. Talmon, I. Cohen, S. Gannot, Single-channel transient interference suppression with diffusion maps. IEEE Trans. Audio Speech Lang. Process. 21(1), 132–144 (2013)CrossRef R. Talmon, I. Cohen, S. Gannot, Single-channel transient interference suppression with diffusion maps. IEEE Trans. Audio Speech Lang. Process. 21(1), 132–144 (2013)CrossRef
45.
go back to reference I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. 81(11), 2403–2418 (2001)CrossRefMATH I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. 81(11), 2403–2418 (2001)CrossRefMATH
Metadata
Title
Audio-Visual Source Separation with Alternating Diffusion Maps
Authors
David Dov
Ronen Talmon
Israel Cohen
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-73031-8_14