nach oben

Multimedia Systems

Erschienen in:

01.11.2012 | Regular Paper

Speech information retrieval: a review

verfasst von: Ryan P. Hafen, Michael J. Henry

Erschienen in: Multimedia Systems | Ausgabe 6/2012

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Speech is an information-rich component of multimedia. Information can be extracted from a speech signal in a number of different ways, and thus there are several well-established speech signal analysis research fields. These fields include speech recognition, speaker recognition, event detection, and fingerprinting. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major speech analysis fields. The goal is to introduce enough background for someone new in the field to quickly gain high-level understanding and to provide direction for further study.

Vorheriger Artikel Video summarization using a network of radial basis functions

Nächster Artikel Optimal strategies for creating paper models from 3D objects

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Adami, A., Mihaescu, R., Reynolds, D., Godfrey, J.: Modeling prosodic dynamics for speaker recognition. In: Proceedings of the ICASSP, vol. 4, pp. 788–791 (2003)

Allamanche, E., Herre, J., Hellmuth, O., Fröba, B., Kastner, T., Cremer, M.: Content-based identification of audio material using MPEG-7 low level description. In: Proceedings of the International Symposium of Music Information Retrieval (2001)

Allegro, S., Buchler, M., Launer, S.: Automatic sound classification inspired by auditory scene analysis. In: Consistent and Reliable Acoustic Cues for Sound Analysis (CRAC), One-Day Workshop, Aalborg, Denmark (2001)

Al-Sawalmeh, W., Daqrouq, K., Daoud, O., Al-Qawasmi, A.: Speaker identification system-based mel frequency and wavelet transform using neural network classifier. Eur. J. Sci. Res. 41(4), 515–525 (2010)

Anguera, X., Wooters, C., Pardo, J.: Robust speaker diarization for meetings: ICSI RT06s evaluation system. In: Ninth International Conference on Spoken Language Processing (2006)

Azmi, M., Tolba, H., Mahdy, S., Fashal, M.: Syllable-based automatic Arabic speech recognition. In: Proceedings of the 7th WSEAS International Conference on Signal Processing, Robotics and Automation, pp. 246–250. World Scientific and Engineering Academy and Society (WSEAS), Greece (2008)

Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C., Morgan, N., O’Shaugnessy, D.: Research developments and directions in speech recognition and understanding. Part 1. IEEE Signal Process. Mag. 26(3), 75–80 (2009)CrossRef

Barbu, T.: A supervised text-independent speaker recognition approach. World Acad. Sci. Eng. Technol. 33 (2007)

Barras, C., Zhu, X., Meignier, S., Gauvain, J.: Improving speaker diarization. In: RT-04F Workshop (2004)

10.

Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4) (2002)

11.

Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Commun. 49(10–11), 763–786 (2007)CrossRef

12.

Bimbot, F., Bonastre, J.F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcıa, J.: A tutorial on text-independent speaker verification. EURASIP J. Appl. Signal Process. 4, 430–451 (2004)

13.

Bonastre, J., Wils, F., Meignier, S.: ALIZE, a free toolkit for speaker recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, USA, pp. 737–740 (2005)

14.

Bonastre, J., Scheffer, N., Matrouf, D., Fredouille, C., Larcher, A., Preti, A., Pouchoulin, G., Evans, N., Fauve, B., Mason, J.: ALIZE/SpkDet: a state-of-the-art open source software for speaker recognition. In: Odyssey-The Speaker and Language Recognition Workshop (2008)

15.

Brill, E.: Discovering the lexical features of a language. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, pp. 339–340. Association for Computational Linguistics (1991)

16.

Brümmer, N., du Preez, J.: Application-independent evaluation of speaker detection. Comput. Speech Lang. 20(2–3), 230–275 (2006)CrossRef

17.

Burges, C., Platt, J., Jana, S.: Distortion discriminant analysis for audio fingerprinting. IEEE Trans. Speech Audio Process. 11(3), 165–174 (2003)CrossRef

18.

Camastra, F., Vinciarelli, A., Yu, J.: Machine learning for audio, image and video analysis. J. Electron. Imaging 18, 029901 (2009)CrossRef

19.

Campbell, J., Reynolds, D., Dunn, R.: Fusing high-and low-level features for speaker recognition. In: Eighth European Conference on Speech Communication and Technology (2003)

20.

Campbell, W., Sturim, D., Reynolds, D.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5) (2006)

21.

Campbell, J.P., Shen, W., Campbell, W.M., Schwartz, R., Bonastre, J.F., Matrouf, D.: Forensic speaker recognition. Signal Process. Mag. IEEE 26(2009), 95–103 (2009)CrossRef

22.

Cano, P., Batlle, E., Kalker, T., Haitsma, J.: A review of audio fingerprinting. J. VLSI Signal Process. 41(3), 271–284 (2005)CrossRef

23.

Canseco-Rodriguez, L., Lamel, L., Gauvain, J.: Speaker diarization from speech transcripts. In: Proceedings of the ICSLP, vol. 4 (2004)

24.

Casey, M.: General sound classification and similarity in MPEG-7. Organ. Sound 6(2), 153–164 (2002)

25.

Cohen, L.: Time frequency distributions—a review. In: Proceedings of the IEEE, vol. 77 (1989)

26.

de Jong, F., Gauvain, J.L., Hiemstra, D., Netter, K.: Language-based multimedia information retrieval. In: In 6th RIAO Conference (2000)

27.

Dunning, T.: Statistical identification of language. Tech. Rep. MCCS 94-273, New Mexico State University (1994)

28.

Dusan, S., Deng, L.: Estimation of articulatory parameters from speech acoustics by Kalman filtering. In: Proceedings of CITO Researcher Retreat-Hamilton (1998)

29.

ELDA: Evaluations and Language Resources Distribution Agency (2010). http://www.elda.org/

30.

Fauve, B.G.B., Matrouf, D., Scheffer, N., Bonastre, J.F.F., Mason, J.S.D.: State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Trans. Audio Speech Lang. Process. 15(7), 1960–1968 (2007)CrossRef

31.

Ferrer, L., Scheffer, N., Shriberg, E.: A comparison of approaches for modeling prosodic features in speaker recognition. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4414–4417. IEEE, New York (2010)

32.

Friedland, A., Vinyals, B., Huang, C., Muller, D.: Fusing short term and long term features for improved speaker diarization. In: Acoustics, Speech and Signal Processing, ICASSP 2009, pp. 4077–4080. IEEE (2009)

33.

Fulop, S., Disner, S.: The reassigned spectrogram as a tool for voice identification. In: Proceedings of ICPhS 2007, pp. 1853–1856 (2007)

34.

Fulop, S., Disner, S.: Advanced time-frequency displays applied to forensic speaker identification. Proc. Meet. Acoust. 6, 060008 (2009)

35.

Gang, C., Hui, T., Xin-meng, C.: Audio segmentation via the similarity measure of audio feature vectors. Wuhan Univ. J. Nat. Sci. 10(5), 833–837 (2005)CrossRef

36.

Gannert, T.: A Speaker Verification System Under the Scope: Alize. Master’s thesis, TMH (2007)

37.

Gravier, G., Betser, M., Ben, M.: Audio Segmentation Toolkit, release 1.2. IRISA (2010)

38.

Haitsma, J., Kalker, T.: A highly robust audio fingerprinting system with an efficient search strategy. J. New Music Res. 32(2), 211–221 (2003)CrossRef

39.

Haitsma, J., Kalker, T., Oostveen, J.: Robust audio hashing for content identification. In: Proceedings of the Content-Based Multimedia Indexing (2001)

40.

Hansen, J., Bou-Ghazale, S., Sarikaya, R., Pellom, B.: Getting started with the SUSAS: speech under simulated and actual stress database. In: Robust Speech Processing Laboratory (1998)

41.

Hansen, J.H., Gavidia-Ceballos, L., Kaiser, J.F.: A nonlinear operator-based speech feature analysis method with application to vocal fold pathology assessment. In: IEEE Transactions on Biomedical Engineering (1998)

42.

Harris, F.: On the use of windows for harmonic analysis with the discrete Fourier transform. Proc. IEEE 66(1), 51–83 (1978)CrossRef

43.

Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRef

44.

Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994)CrossRef

45.

Heymann, M.: sound: A sound interface for R. R package version 1.3 (2010). http://CRAN.R-project.org/package=soun

46.

Huijbregts, M.: Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled. PrintPartners Ipskamp, Enschede (2008)

47.

ISIP: Automatic speech recognition (2010). http://www.isip.piconepress.com/projects/speech/index.html

48.

Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1997)

49.

Jiang, D.N., Cai, L.H.: Speech emotion classification with the combination of statistic features and temporal features. In: IEEE International Conference on Multimedia and Expo (2004)

50.

Jin, Q.: Robust Speaker Recognition. Ph.D. thesis, Carnegie Mellon University (2007)

51.

Kajarekar, S., Ferrer, L., Stolcke, A., Shriberg, E.: Voice-based speaker recognition combining acoustic and stylistic features. In: Advances in Biometrics, pp. 183–201 (2008)

52.

Kenny, P., Boulianne, G., Dumouchel, P.: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005)CrossRef

53.

Kimura, A., Kashino, K., Kurozumi, T., Murase, H.: Very quick audio searching: introducing global pruning to the time-series active search. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 3 (2001)

54.

Kinnunen, T., Li, H.: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)CrossRef

55.

Kinnunen, T.: Spectral features for automatic text-independent speaker recognition. Ph. Lic. thesis, University of Joensuu, Department of Computer Science (2004)

56.

Larcher, A., Lévy, C., Matrouf, D., Bonastre, J.: LIA NIST-SRE’10 systems. Unpublished (2010)

57.

LDC: Language Data Consortium (2010). http://www.ldc.upenn.edu/

58.

Lee, A., Kawahara, T., Takeda, K., Mimura, M., Yamada, A., Ito, A., Itou, K., Shikano, K.: Continuous speech recognition consortium—an open repository for CSR tools and models. In: Proceedings of the IEEE International Conference on Language Resources and Evaluation (2002)

59.

Lee, C.H.: Back to speech science-towards a collaborative ASR community of the 21st century. In: Dynamics of Speech Production and Perception, p. 221 (2006)

60.

Li, S.: Content-based audio classification and retrieval using the nearest feature line method. IEEE Trans. Speech Audio Process. 8(5), 619–625 (2002)CrossRef

61.

Li, D., Sethi, I., Dimitrova, N., McGee, T.: Classification of general audio data for content-based retrieval. Pattern Recognit. Lett. 22(5), 533–544 (2001)MATHCrossRef

62.

Li, X., Tao, J., Johnson, M.T., Soltis, J., Savage, A., Leong, K.M., Newman, J.D.: Stress and emotion classification using Jitter and Shimmer features. In: IEEE International Conference on Acoustics Speech and Signal Processing (2007)

63.

Li, H., Ma, B., Lee, C.: A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Lang. Process. 15(1), 271–284 (2007)CrossRef

64.

Linguistic Data Consortium (2010). http://www.ldc.upenn.edu/

65.

Liscombe, J., Riccardi, G., Hakkaini-Tür, D.: Using context to improve emotion detection in spoken dialog systems. In: Proceedings of Interspeech (2005)

66.

Low, L.S.A., Maddage, N.C., Lech, M., Sheeber, L.B., Allen, N.B.: Detection of clinical depression n adolescents’ speech during family interactions. In: IEEE Transactions on Biomedical Engineering (2011)

67.

Lu, L., Zhang, H., Li, S.: Content-based audio classification and segmentation by using support vector machines. Multimed. Syst. 8(6), 482–492 (2003)CrossRef

68.

Lu, H., Pan, W., Lane, N., Choudhury, T., Campbell, A.: SoundSense: scalable sound sensing for people-centric applications on mobile phones. In: Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services, pp. 165–178. ACM, New York (2009)

69.

Ma, G., Zhou, W., Zheng, J., You, X., Ye, W.: A comparison between HTK and SPHINX on Chinese Mandarin. In: Proceedings of the 2009 International Joint Conference on Artificial Intelligence, pp. 394–397. IEEE Computer Society, New York (2009)

70.

Makhoul, J.: Information extraction from speech. In: Spoken Language Technology Workshop, 2006, p. 3. IEEE, New York (2007)

71.

Meignier, S., Moraru, D., Fredouille, C., Bonastre, J., Besacier, L.: Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2–3), 303–330 (2006)CrossRef

72.

Meignier, S., Merlin, T.: Lium SpkDiarization: an open source toolkit for diarization. In: CMU SPUD Workshop (2010)

73.

Meinedo, H., Neto, J.: Audio segmentation, classification and clustering in a broadcast news task. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 2. IEEE, New York (2003)

74.

Milner, B., Shao, X.: Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model. In: Seventh International Conference on Spoken Language Processing (2002)

75.

Miotto, R., Orio, N.: Automatic identification of music works through audio matching. In: ECDL (2007)

76.

Moore, E. II, Clements, M.A., Peifer, J.W., Weisser, L.: Critical analysis of the impact of glottal features in the classification of clinical depression in speech. In: IEEE Transactions on Biomedical Engineering (2008)

77.

Nexidia: Nexidia Rich Media (2010). http://www.nexidia.com/solutions/rich_media

78.

NIST: Nist Language Recognition Evaluation (2010). http://www.itl.nist.gov/iad/mig/tests/lre/

79.

NIST: Nist Speaker Recognition Evaluation (2010). http://www.itl.nist.gov/iad/mig//tests/sre/

80.

NIST: Rich Transcription Evaluation Project (2010). http://www.itl.nist.gov/iad/mig//tests/rt/

81.

Nwe, T.L., Wei, F.S., Silva, L.D.: Speech based emotion classification. In: Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology (2001)

82.

OLAC: Open Language Archives Community (2010). http://www.language-archives.org/

83.

O’Shaughnessy, D.: Interacting with computers by voice: automatic speech recognition and synthesis. Proc. IEEE 91(9), 1272–1305 (2003)CrossRef

84.

Padgett, C., Cottrell, G.: Representing face images for emotion classification. In: Advances in Neural Information Processing Systems (1997)

85.

Pallett, D.: A look at NIST’s benchmark ASR tests: past, present, and future. In: Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (2003)

86.

Papaodysseus, C., Roussopoulos, G., Fragoulis, D., Panagopoulos, T., Alexiou, C.: A new approach to the automatic recognition of musical recordings. J. Audio Eng. Soc. 49(1/2), 23–35 (2001)

87.

Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: A Speaker Odyssey-The Speaker Recognition Workshop (2001)

88.

Petrovska-Delacrétaz, D., El Hannani, A., Chollet, G.: Text-independent speaker verification: state of the art and challenges. In: Progress in Nonlinear Speech Processing, pp. 135–169 (2007)

89.

Poutsma, A.: Applying Monte Carlo techniques to language identification. In: Proceedings of Computational Linguistics in the Netherlands (CLIN) (2001)

90.

R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2010). http://www.R-project.or. ISBN 3-900051-07-0

91.

Ramachandran, R., Farrell, K., Ramachandran, R., Mammone, R.: Speaker recognition—general classifier approaches and data fusion methods. Pattern Recognit. 35(12), 2801–2821 (2002)MATHCrossRef

92.

Ravindran, S., Anderson, D., Slaney, M.: Low-power audio classification for ubiquitous sensor networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (2004)

93.

Recognition Technologies, Inc. (2010). http://www.recotechnologies.com

94.

Rehurek, R., Kolkus, M.: Language identification on the web: extending the dictionary method. Lect. Notes Comput. Sci. 5449, 357–368 (2009)CrossRef

95.

Reynolds, D.: An overview of automatic speaker recognition technology. IEEE Int. Conf. Acoust. Speech Signal Process. 4, 4072–4075 (2002)

96.

Reynolds, D.: Channel robust speaker verification via feature mapping. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 2 (2003)

97.

Reynolds, D., Campbell, J., Campbell, W., Dunn, R., Gleason, T., Jones, D., Quatieri, T., Quillen, C., Sturim, D., Torres-Carrasquillo, P.: Beyond cepstra: exploiting high-level information in speaker recognition. In: Proceedings of the Workshop on Multimodal User Authentication, pp. 223–229 (2003)

98.

Reynolds, D., Torres-Carrasquillo, P.: The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. In: RT-04F Workshop (2004)

99.

Reynolds, D., Torres-Carrasquillo, P.: Approaches and applications of audio diarization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05) (2005)

100.

Rose, P.: Forensic Speaker Identification. CRC, Boca Raton (2002)CrossRef

101.

Satori, H., Hiyassat, H., Harti, M., Chenfour, N.: Investigation Arabic speech recognition using CMU Sphinx system. Int. Arab J. Inf. Technol. 6(2) (2009)

102.

Schuller, B., Batliner, A., Seppi, D., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N., Kessous, L., Aharonson, V.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: INTERSPEECH (2007)

103.

Sinha, R., Tranter, S., Gales, M., Woodland, P.: The Cambridge University March 2005 speaker diarisation system. In: Ninth European Conference on Speech Communication and Technology (2005)

104.

Sonmez, M., Heck, L., Weintraub, M., Shriberg, E.: A lognormal tied mixture model of pitch for prosody-based speaker recognition. In: Proceedings of the Eurospeech, vol. 3, pp. 1391–1394 (1997)

105.

Sonmez, K., Shriberg, E., Heck, L., Weintraub, M.: Modeling dynamic prosodic variation for speaker verification. In: Fifth International Conference on Spoken Language Processing (1998)

106.

SpeecFind: Search the Speech from Last Century (2010). http://speechfind.utdallas.edu/

107.

Stallard, D., Prasad, R., Natarajan, P.: Development and internal evaluation of speech-to-speech translation technology at BBN. In: PerMIS ’09: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 231–237. ACM, New York (2009). doi:10.1145/1865909.1865956

108.

Stevens, S., Volkmann, J., Newman, E.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8, 185 (1937)CrossRef

109.

Sueur, J., Aubin, T., Simonis, C.: Seewave: a free modular tool for sound analysis and synthesis. Bioacoustics 18, 213–226 (2008). http://sueur.jerome.perso.neuf.fr/WebPage_PapersPDF/Sueuretal_Bioacoustics_2008.pdf

110.

Sukittanon, S., Atlas, L.: Modulation frequency features for audio fingerprinting. In: IEEE International Conference on Acoustics Speech and Signal Processing, vol. 2 (2002)

111.

Switchboard: Spontaneous conversation corpus (2010). http://www.isip.piconepress.com/projects/switchboard/html/overview.html

112.

Teager, H.: Some observations on oral air flow during phonation. In: IEEE Transactions on Acoustics, Speech and Signal Processing (1980)

113.

Tokuhisa, R., Inui, K., Matsumoto, Y.: Emotion classification using massive examples extracted from the web. In: Proceedings of the 22nd International Conference on Computational Linguistics (2008)

114.

Tong, R., Ma, B., Zhu, D., Li, H., Chng, E.S.: Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2006)

115.

Tranter, S., Reynolds, D.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)CrossRef

116.

Tzanetakis, G., Cook, F.: A framework for audio analysis based on classification and temporal segmentation. In: Proceedings of the 25th EUROMICRO Conference, 1999, vol. 2, pp. 61–67. IEEE, New York (2002)

117.

Urbanek, S.: audio: Audio Interface for R (2012). http://www.rforge.net/audio. R package version 0.1-3

118.

Vertanen, K.: Baseline WSJ acoustic models for HTK and Sphinx: training recipes and recognition experiments. Tech. rep., Cavendish Laboratory, University of Cambridge (2006)

119.

VoxForge: Free speech… recognition (2010). http://voxforge.org/

120.

Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: A Flexible Open Source Framework for Speech Recognition, p. 18. Sun Microsystems, Inc., Mountain View (2004)

121.

Wang, A.: An industrial strength audio search algorithm. In: International Conference on Music Information Retrieval (ISMIR) (2003)

122.

Wassner, H., Chollet, G.: New cepstral representation using wavelet analysis and spectral transformation for robust speech recognition. In: Proceedings of ICSLP, vol. 96 (1996)

123.

Woodland, P., Odell, J., Valtchev, V., Young, S.: Large vocabulary continuous speech recognition using HTK. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP-94, vol. 2 (1994)

124.

Wooters, C., Huijbregts, M.: The ICSI RT07s speaker diarization system. In: Multimodal Technologies for Perception of Humans, pp. 509–519 (2009)

125.

Xu, M., Duan, L., Cai, J., Chia, L., Xu, C., Tian, Q.: HMM-based audio keyword generation. In: Advances in Multimedia Information Processing-PCM 2004, pp. 566–574 (2005)

126.

Yang, C., Lin, K.H.Y., Chen, H.H.: Emotion classification using web blog corpora. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (2007)

127.

Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University Engineering Department, Cambridge (2002)

128.

Zhang, T., Kuo, C.: Hierarchical classification of audio data for archiving and retrieving. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, vol. 6, pp. 3001–3004 (1999)

129.

Zhang, J., Whalley, J., Brooks, S.: A two phase method for general audio segmentation. In: IEEE International Conference on Multimedia and Expo, ICME 2009, pp. 626–629. IEEE (2009)

130.

Zhang, X.: Audio Segmentation, Classification and Visualization. Ph.D. thesis, Auckland University of Technology (2009)

131.

Zhu, X., Barras, C., Meignier, S., Gauvain, J.: Combining speaker identification and BIC for speaker diarization. In: Ninth European Conference on Speech Communication and Technology (2005)

132.

Zhu, X., Barras, C., Lamel, L., Gauvain, J.: Speaker diarization: from broadcast news to lectures. In: Machine Learning for Multimodal Interaction, pp. 396–406 (2006)

133.

Zwicker, E.: Subdivision of the audible frequency range into critical bands (Frequenzgruppen). Acoust. Soc. Am. J. 33, 248 (1961)CrossRef

Titel: Speech information retrieval: a review
verfasst von: Ryan P. Hafen
Michael J. Henry
Publikationsdatum: 01.11.2012
Verlag: Springer-Verlag
Erschienen in: Multimedia Systems / Ausgabe 6/2012
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-012-0266-0

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 6/2012

Video summarization using a network of radial basis functions

Inter-destination multimedia synchronization: schemes, use cases and standardization

Assessing the importance of audio/video synchronization for simultaneous translation of video sequences

Optimal strategies for creating paper models from 3D objects