nach oben

The Journal of Supercomputing

Erschienen in:

21.02.2019

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

verfasst von: Han-Gyu Kim, Gil-Jin Jang, Yung-Hwan Oh, Ho-Jin Choi

Erschienen in: The Journal of Supercomputing | Ausgabe 10/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper, we propose speech/music pitch classification based on recurrent neural network (RNN) for monaural speech segregation from music interferences. The speech segregation methods in this paper exploit sub-band masking to construct segregation masks modulated by the estimated speech pitch. However, for speech signals mixed with music, speech pitch estimation becomes unreliable, as speech and music have similar harmonic structures. In order to remove the music interference effectively, we propose an RNN-based speech/music pitch classification. Our proposed method models the temporal trajectories of speech and music pitch values and determines an unknown continuous pitch sequence as belonging to either speech or music. Among various types of RNNs, we chose simple recurrent network, long short-term memory (LSTM), and bidirectional LSTM for pitch classification. The experimental results show that our proposed method significantly outperforms the baseline methods for speech–music mixtures without loss of segregation performance for speech-noise mixtures.

Vorheriger Artikel An effective approach to enhancing a focused crawler using Google

Nächster Artikel Improvement and performance analysis of a power-saving mechanism considering traffic patterns

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467

Choi S, Cichocki A, Park HM, Lee SY (2005) Blind source separation and independent component analysis: a review. Neural Inf Process Lett Rev 6(1):1–57

Elman JL (1991) Distributed representations, simple recurrent networks, and grammatical structure. Mach Learn 7(2–3):195–225

Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1933) Darpa timit acoustic-phonetic continuous speech corpus CD-ROM. NASA STI/Recon Technical Report N 93, pp 1–79

Goodfellow I, Bengio Y, Courville A. Deep learning (2016). http://www.deeplearningbook.org. Book in preparation for MIT Press

Graves A, Fernández S, Schmidhuber J (2005) Bidirectional lstm networks for improved phoneme classification and recognition. Artif Neural Netw Formal Models Appl 2005:753–753

Graves A, Jaitly N, Mohamed Ar (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, pp 273–278

Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069

Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 31–35

10.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

11.

Hu G, Wang D (2004) Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neural Netw 15(5):1135–1150CrossRef

12.

Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

13.

Jang GJ, Lee TW, Oh YH (2003) Single channel signal separation using time-domain basis functions. IEEE Signal Process Lett 10(6):168–171CrossRef

14.

Kim HG, Jang GJ, Park JS, Oh YH (2013) Monaural speech segregation based on pitch track correction using an ensemble Kalman filter. In: Proceedings of Interspeech

15.

Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562

16.

Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, p 3

17.

Olson DL, Delen D (2008) Advanced data mining techniques. Springer, BerlinMATH

18.

Patterson RD, Nimmo-Smith I, Holdsworth J, Rice P (1988) An efficient auditory filterbank based on the gammatone function. Technical report. Annex B of the SVos Final Report: the auditory filterbank, APU Report 2341

19.

Raj B, Virtanen T, Chaudhuri S, Singh R (2010) Non-negative matrix factorization based compensation of music for automatic speech recognition. In: Proceedings of INTERSPEECH, pp 717–720

20.

Roweis ST (2001) One microphone source separation. Adv Neural Inf Process Syst 13:793–799

21.

Ryynänen MP, Klapuri AP (2008) Automatic transcription of melody, bass line, and chords in polyphonic music. Comput Music J 32(3):72–86CrossRef

22.

Smaragdis P (2004) Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. Indep Compon Anal Blind Signal Sep 3195:494–499CrossRef

23.

Smaragdis P, Brown JC (2003) Non-negative matrix factorization for polyphonic music transcription. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 177–180

24.

Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH

25.

Wang DL, Brown GJ (1999) Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans Neural Netw 10(3):684–697CrossRef

26.

Weintraub M (1985) A theory and computational model of auditory monaural sounds separation. Ph.D. Thesis. Stanford University

27.

Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef

Titel: Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation
verfasst von: Han-Gyu Kim
Gil-Jin Jang
Yung-Hwan Oh
Ho-Jin Choi
Publikationsdatum: 21.02.2019
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 10/2020
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-019-02785-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 10/2020

Correction to: Learning class-specific word embeddings

KMLOD: linked open data service for Korean medical database

An effective graph summarization and compression technique for a large-scaled graph

Pseudo-random number generation using LSTMs

Threshold-based portfolio: the role of the threshold and its applications

Hash-tree PCA: accelerating PCA with hash-based grouping

Premium Partner