Top

Published in:

2016 | OriginalPaper | Chapter

2. Acoustic Features and Modelling

Author : Florian Eyben

Published in: Real-time Speech and Music Classification by Large Audio Feature Space Extraction

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This chapter gives an overview of the methods for speech and music analysis implemented by the author in the openSMILE toolkit. The methods described, include all the relevant processing steps from an audio signal to a classification result. These steps include pre-processing and segmentation of the input, feature extraction (i.e., computation of acoustic Low-level Descriptors (LLDs) and summarisation of these descriptors in high level segments), and modelling (e.g., classification).

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

In openSMILE the FFT with complex valued output (and also the inverse FFT) is implemented by the cTransformFFT component. Magnitude and Phase can be computed with the cFFTmagphase component.

In openSMILE windowing of audio samples (i.e., short-time analysis) can be performed with the cFramer component.

http://opensmile.audeering.com.

In openSMILE pre-emphasis can be implemented with the cPreemphasis component on a continuous signal, or with the cVectorPreemphasis component on a frame base (Hidden Markov Toolkit (Young et al. 2006) (HTK) compatible behaviour).

RMS and logarithmic energy can be computed in openSMILE with the cEnergy component.

openSMILE defines \(8.674676 \times 10^{-19}\) as a floor value for the argument of the log, for samples scaled to the range of \(-1\)–\(+1\). In case of sample value range from \(-32767\) to \(+32767\) (HTK compatible mode), the floor value for the argument of the log is 1.

The loudness approximation and the signal intensity as defined here can be extracted in openSMILE with the cIntensity component.

In openSMILE the option dBpsd must be enabled in the cFftMagphase component in order to compute logarithmic power spectral densities.

In openSMILE these spectral scale transformations and spline interpolation can be applied with the cSpecScale component.

http://www.speex.org/.

The SPEEX version of the Bark transformation is implemented in openSMILE as forward transformation only. Not all components will work, as most components require a backward scale transformation.

For an implementation, see the cMelspec component in openSMILE and scale transformation functions in the smileUtil library.

Band spectra can be computed in openSMILE with the cMelspec component, which—despite the name Melspec—can compute general band spectra for all supported frequency scales from a linear magnitude or power spectrum.

In openSMILE the cMelspec component implements these filterbanks for various frequency scales (not only Mel).

In openSMILE the FIR filterbanks with Gabor, gammatone, high- and low-pass filters can be applied with the cFirFilterbank component.

In openSMILE these spectral descriptors can be extracted with the cSpectral component.

In openSMILE, this is implemented in the cSpectral component.

This is the current default in all openSMILE feature sets up to version 2.0. An option for normalisation might appear in later versions.

In the cSpectral component.

Enabled by the option normBandEnergies of the cSpectral component of openSMILE.

ACF according to this equation is implemented in openSMILE in the cAcf component.

In openSMILE linear predictive coding is supported via the cLpc component.

As implemented in openSMILE in the cLpc component.

In openSMILE the cLsp component implements LSP computation based on code from the Speex codec library (www.speex.org).

In openSMILE formant extraction is implemented via this method in the cFormant component, which processes the AR LP coefficients from the cLpc component.

PLP via this method is implemented in openSMILE via the cPlp component.

In openSMILE this Bark scale can be selected in the cMelspec component by setting the specScale option to ‘bark_schroed’.

openSMILE allows for this flexibility because the PLP procedure builds on a chain of components: cTransformFFT, cFFTmagphase, cMelspec (for the non-linear band spectrum), and cPlp (for equal loudness and intensity power law and autoregressive modelling and cepstral coefficients).

In openSMILE it is enabled by setting htkcompatible to 1 in the cPlp component.

Configurable via the option compression in the openSMILE component cPlp.

In openSMILE MFCC are computed via cMelspec (taking FFT magnitude spectrum from cFFTmagphase as input) and cMfcc.

In openSMILE the floor value is also \(10^{-8}\) by default, and 1 when htkcompatible=1 in cMfcc.

Please note, that the DCT equation given in Young et al. (2006) and here differ because Young et al. (2006) start the summation at \(b=1\) for the first Mel-spectrum band, while here the first band is set at \(b=0\).

PLP-CC can be computed in openSMILE by creating a chain of cFFTmagphase, cMelspec, and cPlp and setting the appropriate options for cepstral coefficients in the cPlp component.

In openSMILE this behaviour is implemented in the pitch smoother components and in the cPitchACF component; the output \(F_0\) final contains \(F_0\) with values forced to 0 for unvoiced regions. See the documentation for more details.

In the cPitchACF component, which requires combined ACF and Cepstrum input from two instances of the cAcf component.

The method is implemented in openSMILE in two components: cSpecScale which performs spectral peak enhancement, smoothing, octave scale interpolation, and auditory weighting; cPitchShs which expects the spectrum produced by cSpecScale and performs the shifting, compression, and summation as well as pitch candidate estimation by peak picking.

\(\gamma \) can be changed in openSMILE via the compressionFactor option of the cPitchShs component.

The greedy peak picking algorithm behaviour is achieved in openSMILE when the greedyPeakAlgo option is set to 1. The old (non-greedy) version of the algorithm searched through the peaks from lowest to highest frequency and considered the first peak found as the first candidate. Another candidate was only added if the magnitude was higher than that of the previous first candidate. This behaviour was sub-optimal for Viterbi based smoothing, which requires multiple candidates to evaluate the best path among them.

In openSMILE this behaviour is not implemented in the cPitchShs component, but is rather implemented via the configuration, e.g., for the smileF0_base.conf and IS13_ComParE.conf configurations. Thereby, the cValbasedSelector component is used to force F0 values to 0 (indicating unvoiced parts) if the energy falls below the threshold.

Available in openSMILE via the cPitchSmoother component.

In openSMILE the Viterbi based pitch smoothing is implemented in the cPitchSmoother Viterbi component.

In openSMILE version 2.0 and above, these parameters are implemented by the cHarmonics component.

This definition of Jitter is implemented in openSMILE in the cPitchJitter component. It can be enabled via the jitterLocal option.

This definition of delta Jitter is implemented in openSMILE in the cPitchJitter component. It can be enabled via the jitterDDP option.

searchRangeRel option of the cPitchJitter component in openSMILE.

minCC option in openSMILE.

sourceQualityMean and sourceQualityRange options in cPitchJitter of openSMILE.

In openSMILE CHROMA features are supported by the cChroma component, which requires a semi-tone band spectrum as input, which can be generated by the cTonespec component (preferred) or by the (more general) cMelspec component.

In openSMILE CENS features can be computed from CHROMA (PCP) features with the cCens component.

In openSMILE the simple difference function can be applied with the cDeltaRegression component with the delta window size set to 0 (option deltaWin \(=\) 0).

In openSMILE these delta regression coefficients can be computed with the cDeltaRegression component.

Option deltaWin in openSMILE component cDeltaRegression.

In openSMILE the smoothing via a moving average window is implemented in the cContourSmoother component. Feature names often carry the suffix _sma, which stands for ‘smoothed (with) moving average (filtering)’.

In openSMILE univariate functionals are accessible via the cFunctionals component.

Implementations of mean value related functionals are contained in the cFunctionalMeans component in openSMILE, which can be activated by setting functionalsEnabled = Means in the configuration of cFunctionals.

And is the implementation used in openSMILE.

And also implemented in the cFunctionalMeans component.

In openSMILE the norm option of cFunctionalMeans can be set to segment to normalise counts and times etc. by N.

Implemented in openSMILE in the cFunctionalMoments component.

In openSMILE extreme values can be extracted with the cFunctionalExtremes component.

Percentiles are implemented in openSMILE in the cFunctionalPercentiles component.

In openSMILE the temporal centroid is implemented by the cFunctionalRegression component, as the sums are shared with the regression equations, thus computing both descriptors in the same component increases the efficiency.

In openSMILE the cFunctionalRegression component computes linear and quadratic regression coefficients.

As used in this thesis, in order to avoid a name conflict with the quadratic regression coefficients a and b and time t.

In openSMILE, the time scaling feature is enabled by the normRegCoeff option in cFunctionalRegression component. Setting it to 1 enables the relative time scale \(g=1/N\) and setting it to 2 enables the absolute time scale in seconds.

Option normInputs in openSMILE component cFunctionalRegression—also affects linear and quadratic error.

Option normInputs in the openSMILE component cFunctionalRegression—note that this option also affects the regression coefficients as it effectively normalises the input range.

In openSMILE these functionals are implemented in the component cFunctionalTimes.

Configurable with the norm option in openSMILE.

In openSMILE these functionals can be applied with the cFunctionalPeaks2 component; the cFunctionalPeaks component contains an older, obsolete peak picking algorithm.

In openSMILE in cFunctionalPeaks2 norm=second has to be set for this behaviour (default).

norm=frame in openSMILE.

norm=segment in openSMILE.

In openSMILE the norm option controls this behaviour (frames, seconds, segment—respectively).

See the absThresh and relThresh options in the openSMILE component cFunctionalPeaks2.

In openSMILE segment-based temporal functionals can be computed with the component cFunctionalSegments.

Use the ravgLng option of the cFunctionalSegments component in openSMILE.

This length can be changed via the pauseMinLng option of the cFunctionalSegments component.

Computed in openSMILE by the cFunctionalOnset component.

Provided by the cFunctionalCrossings component in openSMILE.

Sample-based functionals are provided by the cFunctionalSamples component in openSMILE.

In openSMILE the cFunctionalDCT component computes DCT coefficient functionals.

In openSMILE the cFunctionalLpc component computes LP-analysis functionals.

In openSMILE the cFunctionalModulation component computes modulation spectrum functionals.

In openSMILE, the statistics can be applied to the modulation spectrum with the cSpectral component. Also other components which expect magnitude spectra (e.g., ACF in cAcf) can read from the output of cFunctionalModulation.

These features are not part of openSMILE (yet). It is planned to include them in future releases. C code is available from the author of this thesis upon request.

E.g., as is also implemented in the CURRENNT toolkit (http://sourceforge.net/projects/currennt) and the RNNLIB (http://sourceforge.net/projects/rnnl/).

R.G. Bachu, S. Kopparthi, B. Adapa, B.D. Barkana, Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy, in Advanced Techniques in Computing Sciences and Software Engineering, ed. by K. Elleithy (Springer, Netherlands, 2010), pp. 279–282. doi:10.1007/978-90-481-3660-5_47. ISBN 978-90-481-3659-9

A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, V. Aharonson, The impact of F0 extraction errors on the classification of prominence and emotion, in Proceedings of 16-th ICPhS (Saarbrücken, Germany, 2007), pp. 2201–2204

L.L. Beranek, Acoustic Measurements (Wiley, New York, 1949)

C.M. Bishop, Neural Networks for Pattern Recognition (Oxford University Press, New York, 1995)MATH

R.B. Blackman, J. Tukey, Particular pairs of windows, The Measurement of Power Spectra, from the Point of View of Communications Engineering (Dover, New York, 1959)

S. Böck, M. Schedl, Polyphonic piano note transcription with recurrent neural networks, in Proceedings of ICASSP 2012 (Kyoto, 2012), pp. 121–124

P. Boersma, Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. IFA Proc. 17, 97–110 (1993)

P. Boersma, Praat, a system for doing phonetics by computer. Glot Int. 5(9/10), 341–345 (2001)

B.P. Bogert, M.J.R. Healy, J.W. Tukey, The quefrency alanysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe cracking, in Proceedings of the Symposium on Time Series Analysis, chapter 15, ed. by M. Rosenblatt (Wiley, New York, 1963), pp. 209–243

C.H. Chen, Signal Processing Handbook. Electrical Computer Engineering, vol. 51 (CRC Press, New York, 1988), 840 p. ISBN 978-0824779566

A. Cheveigne, H. Kawahara, YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. (JASA) 111(4), 1917–1930 (2002)CrossRef

T.-S. Chi, L.-Y. Yeh, C.-C. Hsu, Robust emotion recognition by spectro-temporal modulation statistic features. J. Ambient Intell. Humaniz. Comput. 3, 47–60 (2012). doi:10.1007/s12652-011-0088-5 CrossRef

J. Cooley, P. Lewis, P. Welch, The finite fourier transform. IEEE Trans. Audio Electroacoust. 17(2), 77–85 (1969)MathSciNetCrossRef

C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)MATH

R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, M. Schröder, Feeltrace: an instrument for recording perceived emotion in real time, in Proceedings of the ISCA Workshop on Speech and Emotion (Newcastle, Northern Ireland, 2000), pp. 19–24

G. Dahl, T. Sainath, G. Hinton, Improving deep neural networks for LVCSR using rectified linear units and dropout, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013), pp. 8609–8613

G. Dalquist, A. Björk, N. Anderson, Numerical Methods (Prentice Hall, Englewood Cliffs, 1974)

S. Damelin, W. Miller, The Mathematics of Signal Processing (Cambridge University Press, Cambridge, 2011). ISBN 978-1107601048CrossRefMATH

G. de Krom, A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. J. Speech Hear. Res. 36, 254–266 (1993)CrossRef

J.R. Deller, J.G. Proakis, J.H.L. Hansen, Discrete-Time Processing of Speech Signals, University of Michigan, Macmillan Publishing Company (1993)

P. Deuflhard, Newton Methods For Nonlinear Problems: Affine Invariance and Adaptive Algorithms. Springer Series in Computational Mathematics, vol. 35 (Springer, Berlin, 2011), 440 p

E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. McRorie, J.C. Martin, L. Devillers, S. Abrilian, A. Batliner, N. Amir, K. Karpouzis, The HUMAINE Database. Lecture Notes in Computer Science, vol. 4738 (Springer, Berlin, 2007), pp. 488–500

J. Durbin, The fitting of time series models. Revue de l’Institut International de Statistique (Review of the International Statistical Institute) 28(3), 233–243 (1960)CrossRefMATH

C. Duxbury, M. Sandler, M. Davies, A hybrid approach to musical note onset detection, in Proceedings of the Digital Audio Effect Conference (DAFX’02) (Hamburg, Germany, 2002), pp. 33–38

L.D. Enochson, R.K. Otnes, Programming and Analysis for Digital Time Series Data, 1st edn. U.S. Department of Defense, Shock and Vibration Information Center (1968)

F. Eyben, B. Schuller, Music classification with the Munich openSMILE toolkit, in Proceedings of the Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval (ISMIR) (ISMIR, Utrecht, 2010a). http://www.music-ir.org/mirex/abstracts/2010/FE1.pdf

F. Eyben, B. Schuller, Tempo estimation from tatum and meter vectors, in Proceedings of the Annual Meeting of the MIREX 2010 community as part of the 11th International Conference on Music Information Retrieval (ISMIR) (ISMIR, Utrecht, 2010b). www.music-ir.org/mirex/abstracts/2010/ES1.pdf

F. Eyben, M. Wöllmer, B. Schuller, openEAR—introducing the Munich open-source emotion and affect recognition toolkit, in Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction (ACII 2009), vol. I (IEEE, Amsterdam, 2009a), pp. 576–581

F. Eyben, M. Wöllmer, B. Schuller, A. Graves, From speech to letters—using a novel neural network architecture for grapheme based ASR, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2009 (IEEE, Merano, 2009b), pp. 376–380

F. Eyben, M. Wöllmer, B. Schuller, openSMILE—The Munich versatile and fast open-source audio feature extractor, in Proceedings of ACM Multimedia 2010 (ACM, Florence, 2010a), pp. 1459–1462

F. Eyben, S. Böck, B. Schuller, A. Graves, Universal onset detection with bidirectional long-short term memory neural networks, in Proceedings of ISMIR 2010 (ISMIR, Utrecht, The Netherlands, 2010b), pp. 589–594

F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, R. Cowie, On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J. Multimodal User Interfaces (JMUI) 3(1–2), 7–19 (2010c). doi:10.1007/s12193-009-0032-6

F. Eyben, M. Wöllmer, B. Schuller, A multi-task approach to continuous five-dimensional affect sensing in natural speech, ACM Trans. Interact. Intell. Syst. 2(1), Article No. 6, 29 p. Special Issue on Affective Interaction in Natural Environments (2012)

G. Fant, Speech Sounds and Features (MIT press, Cambridge, 1973), p. 227

H.G. Feichtinger, T. Strohmer, Gabor Analysis and Algorithms (Birkhäuser, Boston, 1998). ISBN 0-8176-3959-4CrossRefMATH

J.-B.-J. Fourier, Théorie analytique de la chaleur, University of Lausanne, Switzerland (1822)

T. Fujishima, Realtime chord recognition of musical sound: a system using common lisp music, in Proceedings of the International Computer Music Conference (ICMC) 1999 (Bejing, China, 1999), pp. 464–467

S. Furui, Digital Speech Processing: Synthesis, and Recognition. Signal Processing and Communications, 2nd edn. (Marcel Denker Inc., New York, 1996)

C. Glaser, M. Heckmann, F. Joublin, C. Goerick, Combining auditory preprocessing and bayesian estimation for robust formant tracking. IEEE Trans. Audio Speech Lang. Process. 18(2), 224–236 (2010)CrossRef

E. Gómez, Tonal description of polyphonic audio for music content processing. INFORMS J. Comput. 18(3), 294–304 (2006). doi:10.1287/ijoc.1040.0126 CrossRef

F. Gouyon, F. Pachet, O. Delerue. Classifying percussive sounds: a matter of zero-crossing rate? in Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-00) (Verona, Italy, 2000)

A. Graves, Supervised sequence labelling with recurrent neural networks. Doctoral thesis, Technische Universität München, Munich, Germany (2008)

A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)CrossRef

W.D. Gregg, Analog & Digital Communication (Wiley, New York, 1977). ISBN 978-0-471-32661-8

M. Grimm, K. Kroschel, S. Narayanan, Support vector regression for automatic recognition of spontaneous emotions in speech, in Proceedings of ICASSP 2007, vol. 4 (IEEE, Honolulu, 2007), pp. 1085–1088

B. Hammarberg, B. Fritzell, J. Gauffin, J. Sundberg, L. Wedin, Perceptual and acoustic correlates of abnormal voice qualities. Acta Otolaryngol. 90, 441–451 (1980)CrossRef

H. Hanson, Glottal characteristics of female speakers: acoustic correlates. J. Acoust. Soc. Am. (JASA) 101, 466–481 (1997)CrossRef

H. Hanson, E.S. Chuang, Glottal characteristics of male speakers: acoustic correlates and comparison with female data. J. Acoust. Soc. Am. (JASA) 106, 1064–1077 (1999)CrossRef

F.J. Harris, On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66, 51–83 (1978)CrossRef

H. Hermansky, Perceptual linear predictive (PLP) analysis for speech. J. Acoust. Soc. Am. (JASA) 87, 1738–1752 (1990)CrossRef

H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique, in Proceedings of ICASSP 1992, vol. 1 (IEEE, San Francisco, 1992), pp. 121–124

D.J. Hermes, Measurement of pitch by subharmonic summation. J. Acoust. Soc. Am. (JASA) 83(1), 257–264 (1988)CrossRef

W. Hess, Pitch Determination of Speech Signals: Algorithms and Devices (Springer, Berlin, 1983)CrossRef

S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef

S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, in A Field Guide to Dynamical Recurrent Neural Networks, ed. by S.C. Kremer, J.F. Kolen (IEEE Press, New York, 2001)

ISO16:1975. ISO Standard 16:1975 Acoustics: Standard tuning frequency (Standard musical pitch). International Organization for Standardization (ISO) (1975)

T. Joachims, Text categorization with support vector machines: learning with many relevant features, in Proceedings of the 10th European Conference on Machine Learning (ECML-98), ed. by C. Nédellec, C. Rouveirol (Springer, Chemnitz, 1998), pp. 137–142

J.D. Johnston, Transform coding of audio signals using perceptual noise criteria. IEEE J. Sel. Areas Commun. 6(2), 314–332 (1988)CrossRef

P. Kabal, R.P. Ramachandran, The computation of line spectral frequencies using Chebyshev polynomials. IEEE Trans. Acoust. Speech Signal Process. 34(6), 1419–1426 (1986)CrossRef

J.F. Kaiser, Some useful properties of teager’s energy operators, in Proceedings of ICASSP 1993, vol. 3, pp. 149–152, (IEEE, Minneapolis, 1993). doi:10.1109/ICASSP.1993.319457

G.S. Kang, L.J. Fransen, Application of line spectrum pairs to low bit rate speech encoders, in Proceedings of ICASSP 1985, vol.10 (IEEE, Tampa, 1985), pp. 244–247. doi:10.1109/ICASSP.1985.1168526

R. Kendall, E. Carterette, Difference thresholds for timbre related to spectral centroid, in Proceedings of the 4-th International Conference on Music Perception and Cognition (ICMPC) (Montreal, Canada, 1996), pp. 91–95

J.F. Kenney, E.S. Keeping, Root mean square, Mathematics of Statistics, vol. 1, 3rd edn. (Van Nostrand, Princeton, 1962), pp. 59–60

A. Khintchine, Korrelationstheorie der stationären stochastischen prozesse. Math. Ann. 109, 604–615 (1934)MathSciNetCrossRefMATH

A. Kießling, Extraktion und Klassifikation prosodischer Merkmale in der automatischen Sprachverarbeitung (Shaker, Aachen, 1997). ISBN 978-3-8265-2245-1

A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, vol. 25, ed. by F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (Curran Associates, Inc., 2012), pp. 1097–1105

K. Kroschel, G. Rigoll, B. Schuller, Statistische Informationstechnik, 5th edn. (Springer, Berlin, 2011)CrossRefMATH

K. Lee, M. Slaney, Acoustic chord transcription and key extraction from audio using key-dependent HMMs trained on synthesized audio. IEEE Trans. Audio Speech Lang. Process. 16(2), 291–301 (2008). doi:10.1109/TASL.2007.914399. ISSN 1558-7916CrossRef

P. Lejeune-Dirichlet, Sur la convergence des séries trigonométriques qui servent à représenter une fonction arbitraire entre des limites données. Journal für die reine und angewandte Mathematik 4, 157–169 (1829)MathSciNetCrossRef

N. Levinson, A heuristic exposition of wiener’s mathematical theory of prediction and filtering. J. Math. Phys. 25, 110–119 (1947a)

N. Levinson, The Wiener RMS error criterion in filter design and prediction. J. Math. Phys. 25(4), 261–278 (1947b)

P.I. Lizorkin, Fourier transform, in Encyclopaedia of Mathematics, ed. by M. Hazewinkel (Springer, Berlin, 2002). ISBN 1-4020-0609-8

I. Luengo, Evaluation of pitch detection algorithms under real conditions, in Proceedings of ICASSP 2007, vol. 4 (IEEE, Honolulu, 2007), pp. 1057–1060

J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(5), 561–580 (1975)CrossRef

J. Makhoul, L. Cosell, LPCW: an LPC vocoder with linear predictive spectral warping, in Proceedings of ICASSP 1976 (IEEE, Philadelphia, 1976), pp. 466–469

B.S. Manjunath, P. Salembier, T. Sikoraa (eds.), Introduction to MPEG-7: Multimedia Content Description Interface (Wiley, Berlin, 2002), 396 p. ISBN 978-0-471-48678-7

P. Martin, Détection de \(f_0\) par intercorrelation avec une fonction peigne. J. Etude Parole 12, 221–232 (1981)

P. Martin, Comparison of pitch detection by cepstrum and spectral comb analysis, in Proceedings of ICASSP 1982 (IEEE, Paris, 1982), pp. 180–183

J. Martinez, H. Perez, E. Escamilla, M.M. Suzuki, Speaker recognition using mel frequency cepstral coefficients (MFCC) and vector quantization (VQ) techniques, in Proceedings of the 22nd International Conference on Electrical Communications and Computers (CONIELECOMP) (Cholula, Puebla, 2012), pp. 248–251. doi:10.1109/CONIELECOMP.2012.6189918

P. Masri, Computer modelling of sound for transformation and synthesis of musical signal. Doctoral thesis, University of Bristol, Bristol (1996)

S. McCandless, An algorithm for automatic formant extraction using linear prediction spectra. IEEE Trans. Acoust. Speech Signal Process. 22, 134–141 (1974)CrossRef

D.D. Mehta, D. Rudoy, P.K. Wolfe, Kalman-based autoregressive moving average modeling and inference for formant and antiformant tracking. J. Acoust. Soc. Am. (JASA) 132(3), 1732–1746 (2012)CrossRef

H. Misra, S. Ikbal, H. Bourlard, H. Hermansky, Spectral entropy based feature for robust ASR, in Proceedings of ICASSP 2004, vol. 1 (IEEE, Montreal, Canada, 2004), pp. I–193–6. doi:10.1109/ICASSP.2004.1325955

O. Mubarak, E. Ambikairajah, J. Epps, T. Gunawan, Modulation features for speech and music classification, in Proceedings of the 10th IEEE Singapore International Conference on Communication systems (ICCS) 2006 (IEEE, 2006), pp. 1–5. doi:10.1109/ICCS.2006.301515

M. Müller, Information Retrieval for Music and Motion (Springer, Berlin, 2007)CrossRef

M. Müller, F. Kurth, M. Clausen, Audio matching via chroma-based statistical features, in Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR) (London, 2005a), pp. 288–295

M. Müller, F. Kurth, M. Clausen, Chroma-based statistical audio features for audio matching, in Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE, 2005b), pp. 275–278

N.J. Nalini, S. Palanivel, Emotion recognition in music signal using AANN and SVM. Int. J. Comput. Appl. 77(2), 7–14 (2013)

A.M. Noll, Cepstrum pitch determination. J. Acoust. Soc. Am. (JASA) 41(2), 293–309 (1967)CrossRef

A.M. Noll, Pitch determination of human speech by the harmonic product spectrum, the harmonic sum spectrum, and a maximum likelihood estimate, in Symposium on Computer Processing in Communication, vol. 19 (University of Brooklyn, New York, 1970), pp. 779–797, edited by the Microwave Institute

A.H. Nuttal, Some windows with very good sidelobe behavior. IEEE Trans. Acoust. Speech Signal Process. ASSP 29, 84–91 (1981)CrossRef

A.V. Oppenheim, R.W. Schafer, Digital Signal Processing (Prentice-Hall, Englewood Cliffs, 1975)MATH

A.V. Oppenheim, A.S. Willsky, S. Hamid, Signals and Systems, 2nd edn. (Prentice Hall, Upper Saddle River, 1996)

A.V. Oppenheim, R.W. Schafer, J.R. Buck, Discrete-Time Signal Processing (Prentice Hall, Upper Saddle River, 1999)

T.W. Parsons, Voice and Speech Processing. Electrical and Computer Engineering (University of Michigan, McGraw-Hill, 1987)

S. Patel, K.R. Scherer, J. Sundberg, E. Björkner, Acoustic markers of emotions based on voice physiology, in Proceedings of Speech Prosody 2010 (ISCA, Chicago, 2010), pp. 100865:1–4

G. Peeters, A large set of audio features for sound description. Technical report, IRCAM, Switzerland (2004). http://recherche.ircam.fr/equipes/analyse-synthese/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdf. Accessed 3 Sept. 2013

V. Pham, C. Kermorvant, J. Louradour, Dropout improves recurrent neural networks for handwriting recognition, in CoRR (2013) (online), arXiv:1312.4569

J. Platt, Sequential minimal optimization: a fast algorithm for training support vector machines, Technical report MSR-98-14, Microsoft Research (1998)

L.R. Rabiner, On the use of autocorrelation analysis for pitch detection. IEEE Trans. Acoust. Speech Signal Process. 25(1), 24–33 (1977). doi:10.1109/TASSP.1977.1162905 CrossRef

L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRef

L.R. Rabiner, B.H. Juang, An introduction to hidden markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)CrossRef

L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition, 1st edn. (Prentice Hall, Englewood Cliffs, 1993)MATH

L. Rade, B. Westergren, Springers Mathematische Formeln (German translation by P. Vachenauer), 3rd edn. (Springer, Berlin, 2000). ISBN 3-540-67505-1CrossRef

J.F. Reed, F. Lynn, B.D. Meade, Use of coefficient of variation in assessing variability of quantitative assays. Clin. Diagn. Lab. Immunol. 9(6), 1235–1239 (2002)

M. Riedmiller, H. Braun, A direct adaptive method for faster backpropagation learning: the RPROP algorithm, in Proceedings of the IEEE International Conference on Neural Networks, vol. 1 (IEEE, San Francisco, 1993), pp. 586–591. doi:10.1109/icnn.1993.298623

F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne, Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions, in Proceedings of the 2nd International Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE), held in conjunction with FG 2013 (IEEE, Shanghai, 2013), pp. 1–8

S. Rosen, P. Howell, The vocal tract as a linear system, Signals and Systems for Speech and Hearing, 1st edn. (Emerald Group, 1991), pp. 92–99. ISBN 978-0125972314

G. Ruske, Automatische Spracherkennung. Methoden der Klassifikation und Merkmalsextraktion, 2nd edn. (Oldenbourg, Munich, 1993)

K.R. Scherer, J. Sundberg, L. Tamarit, G.L. Salomão, Comparing the acoustic expression of emotion in the speaking and the singing voice. Comput. Speech Lang. 29(1), 218–235 (2015). doi:10.1016/j.csl.2013.10.002 CrossRef

B. Schölkopf, A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (Adaptive Computation and Machine Learning) (MIT Press, Cambridge, 2002)

M. Schröder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, M. Wöllmer, Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3(2), 165–183 (2012)CrossRef

M.R. Schroeder, Period histogram and product spectrum: new methods for fundamental-frequency measurement. J. Acoust. Soc. Am. (JASA) 43, 829–834 (1968)CrossRef

M.R. Schroeder, Recognition of complex acoustic signals, in Life Sciences Research Reports, vol. 5, ed. by T.H. Bullock (Abakon Verlag, Berlin, 1977), 324 p

B. Schuller, Automatische Emotionserkennung aus sprachlicher und manueller Interaktion. Doctoral thesis, Technische Universität München, Munich, Germany (2006)

B. Schuller, Intelligent Audio Analysis. Signals and Communication Technology (Springer, Berlin, 2013)

B. Schuller, A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing (Wiley, Hoboken, 2013), 344 p. ISBN 978-1119971368

B. Schuller, G. Rigoll, M. Lang, Hidden Markov model-based speech emotion recognition, in Proceedings of ICASSP 2003, vol. 2 (IEEE, Hong Kong, 2003), pp. II 1–4

B. Schuller, D. Arsić, F. Wallhoff, G. Rigoll, Emotion recognition in the noise applying large acoustic feature sets, in Proceedings of the 3rd International Conference on Speech Prosody (SP) 2006 (ISCA, Dresden, 2006), pp. 276–289

B. Schuller, F. Eyben, G. Rigoll, Fast and robust meter and tempo recognition for the automatic discrimination of ballroom dance styles, in Proceedings of ICASSP 2007, vol. I (IEEE, Honolulu, 2007), pp. 217–220

B. Schuller, F. Eyben, G. Rigoll, Beat-synchronous data-driven automatic chord labeling, in Proceedings 34. Jahrestagung für Akustik (DAGA) 2008 (DEGA, Dresden, 2008), pp. 555–556

B. Schuller, S. Steidl, A. Batliner, F. Jurcicek, The INTERSPEECH 2009 emotion challenge, in Proceedings of INTERSPEECH 2009 (Brighton, 2009a), pp. 312–315

B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, Acoustic emotion recognition: A benchmark comparison of performances, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2009 (IEEE, Merano, 2009b), pp. 552–557

B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, S. Narayanan, The INTERSPEECH 2010 paralinguistic challenge, in Proceedings of INTERSPEECH 2010 (ISCA, Makuhari, 2010), pp. 2794–2797

B. Schuller, A. Batliner, S. Steidl, F. Schiel, J. Krajewski, The INTERSPEECH 2011 speaker state challenge, in Proceedings of INTERSPEECH 2011 (ISCA, Florence, 2011), pp. 3201–3204

B. Schuller, S. Steidl, A. Batliner, E. Nöth, A. Vinciarelli, F. Burkhardt, R. van Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, B. Weiss, The INTERSPEECH 2012 speaker trait challenge, in Proceedings of INTERSPEECH 2012 (ISCA, Portland, 2012a)

B. Schuller, M. Valstar, R. Cowie, M. Pantic, AVEC 2012: the continuous audio/visual emotion challenge—an introduction, in Proceedings of the 14th ACM International Conference on Multimodal Interaction (ICMI) 2012, ed. by L.-P. Morency, D. Bohus, H.K. Aghajan, J. Cassell, A. Nijholt, J. Epps (ACM, Santa Monica, 2012b), pp. 361–362. October

B. Schuller, F. Pokorny, S. Ladstätter, M. Fellner, F. Graf, L. Paletta. Acoustic geo-sensing: recognising cyclists’ route, route direction, and route progress from cell-phone audio, in Proceedings of ICASSP 2013 (IEEE, Vancouver, 2013a), pp. 453–457

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, et al., The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism, in Proceedings of INTERSPEECH 2013 (ISCA, Lyon, 2013b), pp. 148–152

M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)CrossRef

C.E. Shannon, A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1948). (Reprint with corrections in: ACM SIGMOBILE Mobile Computing and Communications Review 5(1), 3–55 (2001))

M. Slaney, An efficient implementation of the patterson-holdsworth auditory filter bank. Technical Report 35, Apple Computer Inc. (1993)

M. Soleymani, M.N. Caro, E.M. Schmidt, Y.-H. Yang, The MediaEval 2013 brave new task: emotion in music, in Proceedings of the MediaEval 2013 Workshop (CEUR-WS.org, Barcelona, 2013)

F.K. Soong, B.-W. Juang, Line spectrum pair (LSP) and speech data compression, in Proceedings of ICASSP 1984 (IEEE, San Diego, 1984), pp. 1.10.1–1.10.4

A. Spanias, T. Painter, V. Atti, Audio Signal Processing and Coding (Wiley, Hoboken, 2007), 464 p. ISBN 978-0-471-79147-8

J. Stadermann, G. Rigoll, A hybrid SVM/HMM acoustic modeling approach to automatic speech recognition, in Proceedings of INTERSPEECH 2004 (ISCA, Jeju, 2004), pp. 661–664

J. Stadermann, G. Rigoll, Hybrid NN/HMM acoustic modeling techniques for distributed speech recognition. Speech Commun. 48(8), 1037–1046 (2006)CrossRef

J.F. Steffensen, Interpolation, 2nd edn. (Dover Publications, New York, 2012), 256 p. ISBN 978-0486154831

P. Suman, S. Karan, V. Singh, R. Maringanti, Algorithm for gunshot detection using mel-frequency cepstrum coefficients (MFCC), in Proceedings of the Ninth International Conference on Wireless Communication and Sensor Networks, ed. by R. Maringanti, M. Tiwari, A. Arora. Lecture Notes in Electrical Engineering, vol. 299 (Springer, India, 2014), pp. 155–166. doi:10.1007/978-81-322-1823-4_15. ISBN 978-81-322-1822-7

J. Sundberg, The Science of the Singing Voice (Northern Illinois University Press, Dekalb, 1987), p. 226. ISBN 978-0-87580-542-9

D. Talkin, A robust algorithm for pitch tracking (RAPT), in Speech Coding and Synthesis, ed. by W.B. Kleijn, K.K. Paliwal (Elsevier, New York, 1995), pp. 495–518. ISBN 0444821694

L. Tamarit, M. Goudbeek, K.R. Scherer, Spectral slope measurements in emotionally expressive speech, in Proceedings of SPKD-2008 (ISCA, 2008), paper 007

H.M. Teager, S.M. Teager, Evidence for nonlinear sound production mechanisms in the vocal tract, in Proceedings of Speech Production and Speech Modelling, Bonas, France, ed. by W.J. Hardcastle, A. Marchal. NATO Advanced Study Institute Series D, vol. 55 (Kluwer Academic Publishers, Boston, 1990), pp. 241–261

E. Terhardt, Pitch, consonance, and harmony. J. Acoust. Soc. Am. (JASA) 55, 1061–1069 (1974)CrossRef

E. Terhardt, Calculating virtual pitch. Hear. Res. 1, 155–182 (1979)CrossRef

H. Traunmueller, Analytical expressions for the tonotoc sensory scale. J. Acoust. Soc. Am. (JASA) 88, 97–100 (1990)CrossRef

K. Turkowski, S. Gabriel, Filters for common resampling tasks, in Graphics Gems, ed. by A.S. Glassner (Academic Press, New York, 1990), pp. 147–165. ISBN 978-0-12-286165-9CrossRef

G. Tzanetakis, P. Cook, Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002). doi:10.1109/TSA.2002.800560. ISSN 1063-6676CrossRef

P.-F. Verhulst, Recherches mathématiques sur la loi d’accroissement de la population (mathematical researches into the law of population growth increase). Nouveaux Mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles 18, 1–42 (1945)

D. Ververidis, C. Kotropoulos, Emotional speech recognition: resources, features, and methods. Speech Commun. 48(9), 1162–1181 (2006)CrossRef

A.J. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)CrossRefMATH

B. Vlasenko, B. Schuller, A. Wendemuth, G. Rigoll., Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing, in Proceedings of the 2nd International Conference on Affective Computing and Intelligent Interaction (ACII) 2007, ed. by A. Paiva, R. Prada, R.W. Picard. Lecture Notes in Computer Science, Lisbon, Portugal, vol. 4738 (Springer, Berlin, 2007), pp. 139–147

A.L. Wang, An industrial-strength audio search algorithm, in Proceedings of ISMIR (Baltimore, 2003)

F. Weninger, F. Eyben, B. Schuller, The TUM approach to the mediaeval music emotion task using generic affective audio features, in Proceedings of the MediaEval 2013 Workshop (CEUR-WS.org, Barcelona, 2013)

F. Weninger, F. Eyben, B. Schuller, On-line continuous-time music mood regression with deep recurrent neural networks, in Proceedings of ICASSP 2014 (IEEE, Florence, 2014), pp. 5449–5453

P. Werbos, Backpropagation through time: what it does and how to do it. Proc. IEEE 78, 1550–1560 (1990)CrossRef

N. Wiener, Generalized harmonic analysis. Acta Math. 55(1), 117–258 (1930)MathSciNetCrossRefMATH

N. Wiener, Extrapolation, Intrapolation and Smoothing of Stationary Time Series, M.I.T. Press Paperback Series (Book 9) (MIT Press, Cambridge, 1964), 163 p

M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, R. Cowie, Abandoning emotion classes—towards continuous emotion recognition with modelling of long-range dependencies, in Proceedings of INTERSPEECH 2008 (ISCA, Brisbane, 2008), pp. 597–600

M. Wöllmer, F. Eyben, A. Graves, B. Schuller, G. Rigoll, Improving keyword spotting with a tandem BLSTM-DBN architecture, in Advances in Non-linear Speech Processing: Revised selected papers of the International Conference on Nonlinear Speech Processing (NOLISP) 2009, ed. by J. Sole-Casals, V. Zaiats. Lecture Notes on Computer Science (LNCS), vol. 5933/2010 (Springer, Vic, 2010), pp. 68–75

M. Wöllmer, M. Kaiser, F. Eyben, B. Schuller, G. Rigoll, LSTM-Modeling of Continuous Emotions in an Audiovisual Affect Recognition Framework. Image Vis. Comput. (IMAVIS) 31(2), 153–163. Special Issue on Affect Analysis in Continuous Input (2013)

S. Wu, T.H. Falk, W.-Y. Chan, Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011). doi:10.1016/j.specom.2010.08.013. ISSN 0167-6393 (Perceptual and Statistical Audition)CrossRef

Q. Yan, S. Vaseghi, E. Zavarehei, B. Milner, J. Darch, P. White, I. Andrianakis, Formant-tracking linear prediction model using hmms and kalman filters for noisy speech processing. Comput. Speech Lang. 21(3), 543–561 (2007). doi:10.1016/j.csl.2006.11.001

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book, Cambridge University Engineering Department, for HTK version 3.4 edition (2006)

E. Yumoto, W.J. Gould, Harmonics-to-noise ratio as an index of the degree of hoarseness. J. Acoust. Soc. Am. (JASA) 71(6), 1544–1549 (1981)CrossRef

G. Zhou, J.H.L. Hansen, J.F. Kaiser, Nonlinear feature based classification of speech under stress. IEEE Trans. Speech Audio Process. 9(3), 201–216 (2001). doi:10.1109/89.905995 CrossRef

X. Zuo, P. Fung, A cross gender and cross lingual study of stress recognition in speech without linguistic features, in Proceedings of the 17th ICPhS (Hong Kong, China, 2011)

E. Zwicker, Subdivision of the audible frequency range into critical bands. J. Acoust. Soc. Am. (JASA) 33(2), 248–248 (1961)CrossRef

E. Zwicker, Masking and psychological excitation as consequences of ear’s frequency analysis, in Frequency Analysis and Periodicity Detection in Hearing, ed. by R. Plomp, G.F. Smoorenburg (Sijthoff, Leyden, 1970)

E. Zwicker, E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. (JASA) 68, 1523–1525 (1980)CrossRef

E. Zwicker, H. Fastl, Psychoacoustics—Facts and Models, 2nd edn. (Springer, Berlin, 1999), 417 p. ISBN 978-3540650638

Title: Acoustic Features and Modelling
Author: Florian Eyben
Publisher: Springer International Publishing
Book: Real-time Speech and Music Classification by Large Audio Feature Space Extraction
Print ISBN: 978-3-319-27298-6

Electronic ISBN: 978-3-319-27299-3

Copyright Year: 2016
DOI: https://doi.org/10.1007/978-3-319-27299-3_2