Top

International Journal of Speech Technology

Published in:

01-06-2014

Recent developments in spoken term detection: a survey

Authors: Anupam Mandal, K. R. Prasanna Kumar, Pabitra Mitra

Published in: International Journal of Speech Technology | Issue 2/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Spoken term detection (STD) provides an efficient means for content based indexing of speech. However, achieving high detection performance, faster speed, detecting ot-of-vocabulary (OOV) words and performing STD on low resource languages are some of the major research challenges. The paper provides a comprehensive survey of the important approaches in the area of STD and their addressing of the challenges mentioned above. The review provides a classification of these approaches, highlights their advantages and limitations and discusses their context of usage. It also performs an analysis of the various approaches in terms of detection accuracy, storage requirements and execution time. The paper summarizes various tools and speech corpora used in the different approaches. Finally it concludes with future research directions in this area.

previous article Car noise verification and applications

next article Methods for applying VAD in Kazakh speech recognition systems

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

http://www.itl.nist.gov/iad/mig/tests/std/2006/.

https://www.ldc.upenn.edu/.

htk.eng.cam.ac.uk/.

http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context.

http://speech.fit.vutbr.cz/software/hmm-toolkit-stk.

www.openfst.org.

http://www.multimediaeval.org/.

Allauzen, C., Mohri, M., & Saraclar, M. (2004). General indexation of weighted automata: application to spoken utterance retrieval. In HLT-NAACL, Boston, USA.

Avidan, S., & Shamir, A. (2007). Seam carving for content-aware image resizing. ACM Transactions on Graph, 26(3).

Baghai-Ravary, L., Kochanski, G., & Coleman, J. (2009). Data-driven approaches to objective evaluation of phoneme alignment systems. In Proceedings of the 4th conference on human language technology, Poznan, Poland.

Barnwal, S., Sahni, K., Singh, R., & Raj, B. (2012). Spectrographic seam patterns for discriminative word spotting. In Proc. int. conf. acoustics, speech and signal processing, Kyoto, Japan.

Benayed, Y. D., Fohr, J. H., & Chollet, G. (2003). Confidence measures for keyword spotting using support vector machines. In Proc. int. conf. acoustics, speech and signal processing, Hong Kong.

Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources for speech research: present and future infrastructure needs. In Proc. int. conf. speech processing, Brighton, UK.

Bridle, J. (1973). An efficient elastic template method for detecting given key words in running speech. In Proc. of British acoustic society meeting, UK.

Can, D. (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347. CrossRefMathSciNet

Can, P., Cooper, E., Sethy, A., White, C., Ramabhadran, B., & Saraclar, M. (2009). Effect of pronunciations on oov queries in spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.

Chan, C., & Lee, L. (2010). Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proc. int. conf. speech processing, Chiba, Japan.

Chan, C., & Lee, L. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In Proc. int. conf. acoustics, speech and signal processing, Prague.

Chelba, C., & Acero, A. (2005). Position specific posterior lattices for indexing speech. In Annual conference of the association of computational linguistics, Ann Arbor, USA.

Deligne, S., & Bimbot, F. (1995). Language modeling by variable length sequences. In Proc. int. conf. acoustics, speech and signal processing, Michigan, USA.

Ezzat, T., & Poggio, T. (2008). Discriminative word spotting using ordered spectro-temporal patch features. In ISCA workshop statistical and perceptual audition, Brisbane, Australia.

Fousek, P., & Hermansky, H. (2006). Towards ASR based on hierarchical posterior-based keyword recognition. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.

Garcia, A., & Gish, H. (2006). Keyword spotting of arbitrary words using minimal speech resources. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.

Garofolo, J., Auzzane, G., & Voorhees, E. (2000). The trec spoken document retrieval track: a success story. In Ninth text retrieval conference (TREC-9) NIST.

Grangier, D., Keshet, J., & Bengio, S. (2009). Chapter on discriminative keyword spotting. In Automatic speech and speaker recognition: large margin and kernel methods. New York: Wiley.

Hakkani-Tur, D., & Riccardi, G. (2003). A general algorithm for word graph matrix decomposition. In Proc. int. conf. acoustics, speech and signal processing, Hong-Kong.

Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.

Huijbregts, M., McLaren, M., & Leeuwen, D. V. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Prague.

James, D., & Young, S. (1994). A fast lattice-based approach to vocabulary independent wordspotting. In Proc. int. conf. acoustics, speech and signal processing, Adelaide, Australia.

Jansen, A., & Niyogi, P. (2009). Point process models for spotting keywords in continuous speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(8), 1457–1470. CrossRef

Jansen, A., Church, K., & Hermansky, H. (2010). Towards spoken term discovery at scale with zero resources. In Proc. int. conf. speech processing, Chiba, Japan.

Keshet, J., Grangier, D., & Bengio, S. (2007). Discriminative keyword spotting. In Proc. of workshop on non-linear speech processing, Paris, France.

Kintzley, K., Jansen, A., & Hermansky, H. (2011). Event selection from phone posteriorgrams using matched filters. In Proc. int. conf. speech processing, Florence, Italy.

Lehtonen, M., Fousek, P., & Hermansky, H. (2005). IDIAP research report: hierarchical approach for spotting keywords.

Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In Proc. ACM special interest group on information retrieval, New York, USA.

Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language, 14(4), 373–400. CrossRef

Meyers, C., Rabiner, L., & Rosenberg, A. (1980). Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 623–635. CrossRef

Mohri, M., Pereira, F., Pereira, O., & Reiley, M. (1996). Weighted automata in text and speech processing. In ECAI workshop.

Ng, K., & Zue, V. (2000). Subwordbased approaches for spoken document retrieval. Speech Communication, 32(3), 157–186. CrossRef

Novotney, S., Schwartz, R., & Ma, J. (2009). Unsupervised acoustic and language model training with small amounts of labelled data. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.

Pan, Y. C., & shan Lee, L. (2010). Performance analysis for lattice-based speech indexing approaches using words and subword units. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1562–1574. CrossRef

Parada, C., Sethi, A., & Ramabhadran, B. (2009). Query-by-example spoken term detection for oov terms. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.

Park, A. S., & Glass, J. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197. CrossRef

Rohlicek, J. R. (1995). Chapter on word spotting. In Modern methods of speech processing, Norwell: Kluwer Academic.

Rose, R. C. (1996). Word spotting from continuous speech utterances. In Automatic speech and speaker recognition: advanced topics, Norwell: Kluwer Academic.

Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proc. int. conf. acoustics, speech and signal processing, Albuquerque, USA.

Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49. CrossRefMATH

Sandness, E., & Hetherington, I. (2000). Keyword-based discriminative training of acoustic models. In Proc. int. conf. speech and language processing, Beijing, China.

Saraclar, M., & Sproat, R. W. (2004). Lattice based search for spoken utterance retrieval. In HLT-NAACL, Boston, USA.

Shen, W., White, C., & Hazen, T. (2009). A comparison of query-by-example methods for spoken term detection. In Proc. int. conf. speech processing, Brighton, UK.

Silaghi, M., & Bourlard, H. (1999). Iterative posterior-based keyword spotting without filler models. In Proc. IEEE workshop on automatic speech recognition and understanding, Colorado, USA.

Sukkar, R., Seltur, A., Rahim, M. G., & Lee, C. H. (1996). Utterance verification of keyword strings using word-based minimum verification error training. In Proc. int. conf. acoustics, speech and signal processing, Atlanta, USA.

Szoke, I., Schwarz, P., Patejka, P., Burget, L., Karafiat, M., Fapso, M., & Cernocky, J. (2005). Comparison of keyword spotting approaches for informal continuous speech. In Eurospeech, Lisbon, Portugal.

Szoke, I., Burget, L., Cernocky, J., & Fapso, M. (2008). Sub-word modeling of out-of-vocabulary words in spoken term detection. In Spoken language technology workshop, Goa, India.

Tejedor, J., Szoke, I., & Fapso, M. (2010). Novel methods for query selection and combination in query-by-example spoken term detection. In ACM workshop on searching spontaneous conversational speech, Firenze, Italy.

Thambiratnam, K., & Sridharan, S. (2005). Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting. In Proc. int. conf. acoustics, speech and signal processing, Philadelphia, USA.

Vergyri, D., Shafran, I., Stocke, A., Gadde, R., Akbacak, M., Roark, B., & Wang, W. (2007). The sri/ogi 2006 spoken term detection system. In Proc. int. conf. speech processing, Antwerp, Belgium.

Wang, H., Lee, T., & Leung, C. (2011). Unsupervised spoken term detection with acoustic segment model. In Int. conf. speech database and assessments, China.

Weintraub, M., Beaufays, F., Rivlin, Z., Konig, Y., & Stolcke, A. (1997). Neuralnetwork based measures of confidence for word recognition. In Proc. int. conf. acoustics, speech and signal processing, Munich, Germany.

Wright, C., Ballar, L., Coull, S., Monrose, F., & Masson, G. (2010). Uncovering spoken phrases in encrypted voice over IP conversations. ACM Transactions on Information and System Security, 13(4), 35.1–35.30. CrossRef

Zhang, Y., & Glass, J. (2009). Unsupervised spoken keyword spotting via segmental dtw on Gaussian posteriorgrams. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.

Zhang, Y., & Glass, J. (2011). An inner-product lower-bound estimate for dynamic time warping. In Proc. int. conf. acoustics, speech and signal processing, Prague.

Title: Recent developments in spoken term detection: a survey
Authors: Anupam Mandal
K. R. Prasanna Kumar
Pabitra Mitra
Publication date: 01-06-2014
Publisher: Springer US
Published in: International Journal of Speech Technology / Issue 2/2014
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-013-9217-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2014

Methods for applying VAD in Kazakh speech recognition systems

Syllable based text to speech synthesis system using auto associative neural network prosody prediction

Car noise verification and applications

Audio watermarking in transform domain based on singular value decomposition and Cartesian-polar transformation

An improved feature transformation method using mutual information

A semantic parsing approach for Bhutanese language of Dzongkha