Sparse coding over redundant dictionaries for fast adaptation of speech recognition system

https://doi.org/10.1016/j.csl.2016.10.004Get rights and content

Highlights

  • A novel use of sparse coding is done for the on-line adaptation of HMM based ASR systems.

  • The target is first sparse coded using OMP over exemplar/learned speaker dictionaries.

  • Adapted model is obtained by the maximum likelihood scaling of the sparse coded target.

  • Performance same as the existing techniques is obtained but with much lower complexity.

Abstract

This work presents a novel use of the sparse coding over redundant dictionary for fast adaptation of the acoustic models in the hidden Markov model-based automatic speech recognition (ASR) systems. The presented work is an extension of the existing acoustic model-interpolation-based fast adaptation approaches. In these methods, the basis (model) weights are estimated using an iterative procedure employing the maximum-likelihood (ML) criterion. For effective adaptation, typically a number of bases are selected and as a result of that the latency of the iterative weight estimation process becomes high for those ASR tasks that involve human-machine interactions. To address this issue, we propose the use of sparse coding of the target mean supervector over a speaker-specific (exemplar) redundant dictionary. In this approach, the employed greedy sparse coding not only selects the desired bases but also compresses them into a single supervector, which is then ML scaled to yield the adapted mean parameters. Thus reducing the latency in the basis weight estimation in comparison to the existing fast adaptation techniques. Further, to address the loss in information due to reduced degrees of freedom, we have also extended the proposed approach using separate sparse codings over multiple (exemplar and learned) redundant dictionaries. In adapting an ASR task involving human-computer interactions, the proposed approach is found to be as effective as the existing techniques but with a substantial reduction in the computational cost.

Introduction

The automatic speech recognition (ASR) systems are traditionally developed using the Gaussian mixture model (GMM) based context-dependent hidden Markov model (CD-HMM). Recently, the deep neural network (DNN) (Hinton et al., 2012) has been proposed to overcome the inefficiency of the GMM in the modeling of the events that lie on or near a nonlinear manifold in the data. With the advent of the DNN for generating the observation probabilities, the ASR systems based on DNN–HMM are fast becoming popular1. In general, the ASR systems are trained on speech data from a large number of speakers (male and female). This pooled training is primarily intended towards making the statistical models speaker independent (SI). Unlike an speaker dependent (SD) system, the SI systems have to deal with both the intra-speaker and the inter-speaker variability. As a result of that, an SI system is reported to be 2 to 3times inferior in comparison to an SD system when both the systems are trained with an equal amount of data (Woodland, 2001). The SD systems, though quite effective, are infeasible to be built for each of the speakers in the user population on account of requiring a large amount of speech data per speaker. Consequently, the speaker adaptation techniques have been developed which intend to modify the parameters of the SI system to better suit a particular speaker given the limited amount of data from that speaker.

Some of the applications of ASR are information retrieval, language learning tools, voice-based search and entertainment (Eskenazi, 2009, Gray, Willett, Pinto, Lu, Maergner, Bodenstab, 2014, Schalkwyk, Beeferman, Beaufays, Byrne, Chelba, Cohen, Kamvar, Strope, 2010, Wang, Yu, Ju, Acero, 2008). In the case of ASR tasks involving human-machine interactions, the system is required to recognize speech from the adult as well as the child (male/female) speakers. Further in such ASR tasks, the adaptation data is made available to the ASR system in an incremental manner. Moreover, the duration of the data is generally very small. Consequently, the conventional adaptation techniques like the maximum a-posteriori (MAP) (Gauvain and Lee, 1994) and the maximum likelihood linear regression (MLLR) (Digalakis, Rtischev, Neumeyer, 1995, Leggetter, Woodland, 1995) are found to be unsuitable. This is so because the available adaptation data is insufficient to estimate a large number of parameters. On the other hand, the fast adaptation techniques try to modify the model parameters even with a small amount of adaptation data (Gales, 1999, Hazen, Glass, 1997, Kenny, Boulianne, Dumouchel, 2005, Kuhn, Junqua, Nguyen, Niedzielski, 2000). In these techniques the adapted model parameters are derived by a linear interpolation of a set of bases (predefined acoustic models) spanning a low (K) dimensional subspace. Consequently, only a few interpolation weights (or the direction coordinates) need to be estimated. Such a reduction in complexity makes these approaches amenable for aforementioned adaptation tasks. The fast adaptation approaches are found to be effective for the adaptation of the Gaussian means as well as the mixture-weights (Duchateau, Leroy, Demuynck, Van hamme, 2008, Hahm, Ohkawa, Ito, Suzuki, Ito, Makino, 2010) of the acoustic model.

In the initial works on fast adaptation (Gales, 1999, Hazen, Glass, 1997, Kenny, Boulianne, Dumouchel, 2005, Kuhn, Junqua, Nguyen, Niedzielski, 2000), the K bases to be interpolated were kept fixed. The recent works (Mak, Lai, Hsiao, 2006, Teng, Gravier, Bimbot, Soufflet, 2009) have shown that an improved recognition performance can be achieved for the given test data by selecting K bases from a predefined set of acoustic models. In those approaches, speaker adapted (SA) models corresponding to each of the N speakers in the training set are derived first. This is usually done by adapting the mean parameters of the SI model with the speaker-specific data while keeping the covariance matrices and the mixture-weights unchanged. Given the test data, a set of K models (bases) are then selected from the N acoustic models and are linearly combined. The K interpolation weights required for linear combination are estimated using an iterative maximum-likelihood (ML) approach. It is worth mentioning here that there do exist some fast adaptation approaches in which the bases are derived using the given adaptation data rather than using predefined ones (Gales, 1999, Kenny, Boulianne, Dumouchel, 2005).

The aforementioned fast adaptation techniques differ mainly in the way the K bases are created/selected. The work presented in Mak et al. (2006) employs a Viterbi-alignment-based ML search for the basis selection. The ML search becomes very cumbersome when N is large and hence is not suitable for adapting interactive ASR tasks. The approach reported in Teng et al. (2009) requires the estimation of the interpolation weights for all the N models using a single iteration of ML estimation. The K acoustic models having the largest magnitude for the interpolation weights are selected for deriving the adapted model. The interpolation weights for those K models are then re-estimated iteratively. The complexity introduced in estimating weights for all the N models limits the feasibility of this approach in the case of ASR tasks involving human-machine interactions.

The work presented in this paper is inspired by the recent works employing a dynamic selection of acoustically close bases. In our earlier work (Shahnawazuddin and Sinha, 2014a), techniques employing the sparse representation (SR) (Elad, 2010) over a redundant dictionary were explored for the dynamic selection of the bases. In that work, the sparse coding was used only for the basis selection while the basis coefficients in the sparse coding were discarded. Like the other model-interpolation-based techniques, the interpolation weights were estimated iteratively in the ML sense. The latency in the basis selection process was reduced due to the greedy SR-based approaches but the computational cost of the weight estimation remained the same. In contrast to our earlier work, in this paper we propose a novel use of the sparse coding to derive the adapted model mean parameters.2 The main contributions of the work are as follows:

  • Exploration of the sparse-coding-based basis selection approach to reduce the computational cost in learning of the interpolation weights for the selected bases.

  • Enhancement of the proposed approach with sparse coding over multiple redundant dictionaries which relaxes the degrees of freedom in the weight estimation without much increase in the computational cost.

  • Exploration of the proposed approach in the context of highly mismatched acoustic conditions as well as in combination with other existing fast adaptation approaches.

The remaining of this paper is organized as follows: In Section 2, a brief review of the SR-based basis selection technique is given. In Section 3, we describe the proposed scheme that employs sparse coding to derive the Gaussian mean parameters. The multiple dictionary-based approach to overcome the loss in information is presented in Section 4. The experimental setup and evaluation of the proposed schemes are discussed in Section 5. Finally the paper is concluded in Section 6.

Section snippets

Review of SR-based basis selection

The ML search involving the Viterbi-based alignment is one of the techniques used for the selection of the bases (Mak et al., 2006). As already mentioned, N-SA models (corresponding to each of the training speakers) are first created by adapting the Gaussian mean vectors of the SI system. The K number of SA models having the highest likelihood with respect to the given test data are then selected for interpolation. In the ML basis-search, the test data is aligned against each of the SA models

Using sparse coded target as model parameter

In the approach outlined in Section 2, the model interpolation is done using weights derived by an iterative ML estimation procedure. The work reported in Gales (1999) employed 4 iterations during the weight estimation. But in our earlier work (Shahnawazuddin and Sinha, 2014a), it is noted that 6 to 7 iterations are required to ensure the convergence. This accounts for the major portion of computational cost in the model-interpolation-based adaptation process. For addressing this, we explored

Increasing the degrees of freedom

The scheme discussed in Section 3 happens to substantially reduce the computational cost compared to jointly learning the scaling factors for all the K bases. At the same time, the restriction in the degrees of freedom in the weight estimation process does lead to some loss of information. It is quite evident from Fig. 1(c) that the employed global scaling is bound to be suboptimal as there is a small spread about the mean. Consequently, the obtained recognition performance is noted to be

Experimental setup

To evaluate the performance of the proposed technique, a GMM–HMM-based ASR system is developed using the HTK toolkit (Young et al., 2006). The WSJCAM0 speech corpus (Robinson et al., 1995) is used for learning the GMM–HMM acoustic models. The training set consists of 7861 utterances from 92 (male/female) speakers with approximately 90 sentences per speaker. This amounts to a total of 15.5 h of speech data. In order to simulate telephone-based query system, all speech data is re-sampled to 8kHz

Conclusion and future work

A novel fast adaptation approach employing scaled sparse coding over redundant dictionaries has been proposed in this paper. The proposed approach is found to result in a recognition performance similar to that obtained for the existing techniques in the utterance-specific and the incremental modes of adaptation. The same is verified experimentally in this work using two different tests sets, viz. the adults’ test set (Nov’92) and the children’s test set (PFts). In the case of PFts, the

Acknowledgment

The authors wish to thank Dr. Kai Yu, Research Professor in the Computer Science and Engineering Department, Shanghai Jiao Tong University, for sharing the HTK patch code for CAT.

References (45)

  • M.A.T. Figueiredo et al.

    Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems

    IEEE J. Select. Topics Signal Process.

    (2007)
  • M.J.F. Gales

    The Generation And Use Of Regression Class Trees For MLLR Adaptation

    Technical Report

    (1996)
  • M.J.F. Gales

    Cluster adaptive training of hidden Markov models

    IEEE Trans. Speech Audio Process.

    (1999)
  • J.L. Gauvain et al.

    Maximum a-posteriori estimation for multivariate Gaussian mixture observations of Markov chains

    IEEE Trans. Speech Audio Process.

    (1994)
  • S. Ghai

    Addressing Pitch Mismatch for Children’s Automatic Speech Recognition

    (2011)
  • S.S. Gray et al.

    Child automatic speech recognition for US English: Child interaction with living-room-electronic-devices

    Proceedings of INTERSPEECH, Workshop on Child, Computer and Interaction

    (2014)
  • S. Hahm et al.

    Aspect-model-based reference speaker weighting

    Proceedings of ICASSP

    (2010)
  • T.J. Hazen et al.

    A comparison of novel techniques for instantaneous speaker adaptation

    Proceedings of of European Conference on Speech Communication and Technology

    (1997)
  • G.E. Hinton et al.

    Deep neural networks for acoustic modeling in speech recognition

    Signal Process. Mag.

    (2012)
  • JeongY.

    Speaker adaptation using probabilistic linear discriminant analysis for continuous speech recognition

    IET Lett.

    (2013)
  • JeongY. et al.

    New speaker adaptation method using 2-D PCA.

    IEEE Signal Process. Lett.

    (2010)
  • I.T. Jolliffe

    Principal Component Analysis

    (1986)
  • View full text