Skip to main content
Log in

Text-Informed Audio Source Separation. Example-Based Approach Using Non-Negative Matrix Partial Co-Factorization

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

The so-called informed audio source separation, where the separation process is guided by some auxiliary information, has recently attracted a lot of research interest since classical blind or non-informed approaches often do not lead to satisfactory performances in many practical applications. In this paper we present a novel text-informed framework in which a target speech source can be separated from the background in the mixture using the corresponding textual information. First, given the text, we propose to produce a speech example via either a speech synthesizer or a human. We then use this example to guide source separation and, for that purpose, we introduce a new variant of the non-negative matrix partial co-factorization (NMPCF) model based on a so-called excitation-filter-channel speech model. Such a modeling allows sharing the linguistic information between the speech example and the speech in the mixture. The corresponding multiplicative update (MU) rules are eventually derived for the parameters estimation and several extensions of the model are proposed and investigated. We perform extensive experiments to assess the effectiveness of the proposed approach in terms of source separation and alignment performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

Notes

  1. NMPCF model [5] is a particular case of a more general generalized coupled tensor factorization (GCTF) model that was used as well for informed source separation [17].

  2. The proposed EFC model is a new extension of the excitation-filter model [19].

  3. Keep in mind that W ϕ = P E ϕ, \(\mathbf{w}^{c}_{\scriptscriptstyle {S}} = \mathbf{P}\mathbf{e}^{c}_{\scriptscriptstyle {S}}\) and \(\mathbf{w}^{c}_{\scriptscriptstyle {Y}} = \mathbf{P}\mathbf{e}^{c}_{\scriptscriptstyle {Y}}\).

  4. When applied to power spectrograms of audio signals, IS divergence was shown as one of the most suitable choices for NMF-like decompositions [23], in particular thanks to its scale invariance property.

  5. We used ”ivona” synthesizers www.ivona.com/en/ to create speech examples.

  6. We implemented these approaches with help of the FASST [20].

References

  1. Le Magoarou, L., A. Ozerov, N.Q.K. Duong. (2013). Textinformed audio source separation using nonnegative matrix partial co-factorization. In Text-informed audio source separation using nonnegative matrix partial co-factorization. Machine Learning for Signal Processing (MLSP), 2013 IEEE International Workshop on, pages 1–6.

  2. E. Vincent, S. Araki, F. Theis, G. Nolte, P. Bofill, H. Sawada, A. Ozerov, V. Gowreesunker, D. Lutter, N. Q. K. Duong (2012). The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges. Signal Processing, 92 (8), 1928–1936.

  3. Ganseman, J., Mysore, G.J., Abel, J.S., Scheunders, P. (2010). Source separation by score synthesis. In Proceedings of the international computer music conference (ICMC) (pp. 462–465). New York: NY.

  4. Hennequin, R., David, B., Badeau, R. (2011). Score informed audio source separation using a parametric model of non-negative spectrogram. In Proceedings of the IEEE International Conference on Acoustics, speech, and signal processing (ICASSP) (pp. 45–48). Czech Republic: Prague.

  5. Simsekli, U., & Cemgil, A.T. (2012). Score guided musical source separation using generalized coupled tensor factorization. In Proceedings of the 20th European signal processing conference (EUSIPCO) (pp. 2639–2643).

  6. Fritsch, J., & Plumbley, M.D. (2013). Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 888–891).

  7. Smaragdis, P., & Mysore, G.J. (2009). Separation by ”humming”: User-guided sound extraction from monophonic mixtures. In Proceedings of the IEEE workshop applications of signal processing to audio and acoustics (WASPAA) (pp. 69–72).

  8. FitzGerald, D. (2011). User assisted source separation using nonnegative matrix factorisation. In 22nd IET Irish signals and systems conference. Dublin.

  9. Durrieu, J.L., & Thiran, J.P. (2012). Musical audio source separation based on user-selected F0 track. In Proceedings of the international conference on latent variable analysis and signal separation (LVA/ICA) (pp. 438–445). Israel: Tel-Aviv.

  10. A. Ozerov, C. Févotte, R. Blouet, J.-L. Durrieu. (May 2011). In Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation, (pp. 257–260). Czech Republic: Prague.

  11. Lefèvre, A., Bach, F., Févotte, C. (2012). Semi-supervised NMF with time-frequency annotations for single-channel source separation. In Proceedings of the international symposium on music information retrieval (ISMIR) (pp. 115–120). Portugal: Porto.

  12. Bryan, N.J., & Mysore, G.J. Interactive user-feedback for sound source separation. In International conference on intelligent user interfaces (IUI) (p. 2013). Santa Monica.

  13. Duong, Q.K., Ngoc, Ozerov, A., Chevallier, L., Sirot, J. An interactive audio source separation framework based on nonnegative matrix factorization. In Proceedings of the IEEE international Conference on on acoustics speech and signal processing (ICASSP) (p. 2014). Italie: Florence.

  14. Roweis, S.T. (2000). One microphone source separation. In Advances in Neural Information Processing Systems 13 (pp. 793–799): MIT Press.

  15. Wang,W., Cosker, D., Hicks, Y., Sanei, S., Chambers, J.A. (2005). Video assisted speech source separation. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (pp. 425–428). Philadelphia: USA.

  16. Mysore, G.J., & Smaragdis, P. (2012). A non-negative approach to language informed speech separation. In Proceedings of the international conference on latent variable analysis and signal separation (LVA / ICA) (pp. 356–363). Israel: Tel-Aviv.

  17. Kim, M., Yoo, J., Kang, K., Choi, S. (2011). Nonnegative matrix partial co-factorization for spectral and temporal drum source separation. IEEE Journal of Selected Topics in Signal Processing, 5(6), 1192–1204.

  18. Virtanen, T., & Klapuri, A. (2006). Analysis of polyphonic audio using source-filter model and non-negative matrix factorization. In Advances in models for acoustic processing, neural information processing systems workshop.

  19. Durrieu, J.L., Richard, G., David, B., F´evotte, C. (2010). Source/filter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Transactions on Audio, Speech and Language Processing, 18(3), 564–575.

  20. Ozerov, A., Vincent, E., Bimbot, F. (2012). A general flexible framework for the handling of prior information in audio source separation. IEEE Transactions on Audio, Speech and Signal Processing, 20(4), 1118–1133.

  21. Duong, N.Q.K., Vincent, E., Gribonval, R. (2010). Underdetermined reverberant audio source separation using a full-rank spatial covariance model. IEEE Transactions on Audio, Speech and Language Processing, 18(7), 1830–1840.

  22. Ono, N., Koldovsky, Z., Miyabe, S., Ito, N. (2013). The 2013 signal separation evaluation campaign. In 2013 IEEE International workshop on machine learning for signal processing (MLSP), (pp. 1–6).

  23. Févotte, C., Bertin, N., Durrieu, J.-L. (Mar. 2009). Nonnegative matrix factorization with the Itakura-Saito divergence. With application to music analysis. Neural Computation, 21(3), 793–830.

  24. Pedone, A., Burred, J.J., Maller, S., Leveau, P. (2011). Phonemelevel text to audio synchronization on speech signals with background music. In Proceedings of the INTERSPEECH, (pp. 433–436).

  25. Ellis, D. (2003). Dynamic time warp (DTW) in Matlab. Web resource. http://www.ee.columbia.edu/ln/labrosa/matlab/dtw/.

  26. Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N. (1993). DARPA TIMIT: Acoustic-phonetic continuous speech corpus. Technical report, NIST, distributed with the TIMIT CD-ROM.

  27. Vincent, E., Gribonval, R., Fevotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1462–1469.

  28. V. Emiya, E. Vincent, N. Harlander, V. Hohmann. Subjective and objective quality assessment of audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 19(7):, 2046–2057.

  29. Emiya, V., Vincent, E., Harlander, N., Hohmann, V. (2011). Subjective and objective quality assessment of audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 19(7), 2046–2057.

Download references

Acknowledgements

The authors would like to thank S. Ayalde, F. Lefebvre, A. Newson and N. Sabater for their help in producing the speech examples, as well as the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luc Le Magoarou.

Additional information

Most of this work was done while the first author was with Technicolor, and a part of the work has been presented at the 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) [1].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le Magoarou, L., Ozerov, A. & Duong, N.Q.K. Text-Informed Audio Source Separation. Example-Based Approach Using Non-Negative Matrix Partial Co-Factorization. J Sign Process Syst 79, 117–131 (2015). https://doi.org/10.1007/s11265-014-0920-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-014-0920-1

Keywords

Navigation