Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain

doi:10.1016/j.specom.2013.07.005

Speech Communication

Volume 56, January 2014, Pages 167-180

https://doi.org/10.1016/j.specom.2013.07.005 Get rights and content

Highlights

•
“Real-world” data is used for speech recognition systems in 4 Indian languages.
•
The subspace Gaussian mixture model is effective for insufficient training data.
•
Cross-corpus acoustic mismatch is a serious issue for multi-lingual systems.
•
Apparent cross-lingual phonetic similarities for Hindi and Marathi are “discovered”.

Abstract

In developing speech recognition based services for any task domain, it is necessary to account for the support of an increasing number of languages over the life of the service. This paper considers a small vocabulary speech recognition task in multiple Indian languages. To configure a multi-lingual system in this task domain, an experimental study is presented using data from two linguistically similar languages – Hindi and Marathi. We do so by training a subspace Gaussian mixture model (SGMM) (Povey et al., 2011, Rose et al., 2011) under a multi-lingual scenario (Burget et al., 2010, Mohan et al., 2012a). Speech data was collected from the targeted user population to develop spoken dialogue systems in an agricultural commodities task domain for this experimental study. It is well known that acoustic, channel and environmental mismatch between data sets from multiple languages is an issue while building multi-lingual systems of this nature. As a result, we use a cross-corpus acoustic normalization procedure which is a variant of speaker adaptive training (SAT) (Mohan et al., 2012a). The resulting multi-lingual system provides the best speech recognition performance for both languages. Further, the effect of sharing “similar” context-dependent states from the Marathi language on the Hindi speech recognition performance is presented.

Introduction

With the proliferation and penetration of cellular telephone networks in remote regions, a larger proportion of the planet’s population has inexpensive access to meet its communication needs. This has resulted in a wider audience, especially under-served populations, that have access to telephony based information services. Spoken dialog (SD) systems provide a natural method for information access, especially for users who have little or no formal education. Among the challenges facing the development of such systems is the need to configure them in languages that are under-resourced.

Agriculture provides a means of livelihood for over 50% of India’s population. Most of India’s farming is small scale; 78% of farms are five acres or less (Patel et al., 2010, Singh et al., 1999). Indian farmers, a proportion of whom are illiterate, face a host of challenges such as water shortages, increasing cost of farm supplies and inequitable distribution systems that govern the sale of their produce. Accesss to information through information and communication technologies (ICT) is often seen as a solution to enable and empower rural Indian farmers. There have been many noted efforts in this direction to develop ICTs by members of the private sector, non-governmental organizations (NGOs) and the government. E-Choupal (Bowonder et al., 2003), Avaaj Otalo (Patel et al., 2010), and the Mandi Information System (Mantena et al., 2011, Shrishrimal et al., 2012) are examples of such efforts that have been undertaken in the past or are currently ongoing. In addition, it is noteworthy to mention the evaluation done by Plauché and Nallasamy (2007) to assess the factors involved in setting up a low-cost, small-scale spoken dialogue (SD) system to disseminate agricultural information to farmers in rural Tamil Nadu, India.

The work reported on the Mandi Information System is part of a larger effort called “Speech-based Access for Agricultural Commodity Prices in Six Indian Languages” initiated by the Goverment of India (Mantena et al., 2011, Shrishrimal et al., 2012). The project involves development of spoken dialog systems to access prices of agricultural commodities in various Indian districts. This information, updated daily, is also made available online through an Indian government web-portal www.agmarknet.nic.in. Needless to say, configuring SD systems especially in any or most of these languages is hardly a trivial task. Also, a large investment in time, money and effort is needed to collect, annotate and transcribe speech data required to develop the automatic speech recognition (ASR) engine that the SD systems depend upon.

Our goal in this paper is not to describe the development of the dialog system or the data collection effort itself, but to describe acoustic modelling configurations that would make the process of developing such systems more efficient. We use a subset of the data provided to us by the teams involved in the project “Speech based Access for Agricultural Commodity Prices in Six Indian Languages” for our experimental study. In terms of the target population for the service, the recording conditions of the speech data, the nature of the task (small-vocabulary) and the service itself is similar to the work described by authors in Plauché and Nallasamy (2007). Section 2 provides a description of the data, which was provided to us at a time when the development of the dialog systems was still in progress. We restrict our multi-lingual experimental study to two out of six languages – Hindi and Marathi. With limited available training data in Hindi, and the existing sources of acoustic, background and channel variabilities associated with the data collected for each language, the data set provided to us from this task domain poses some unique challenges. The speech data is “real-world”, in that it has been collected in conditions that shall be experienced in the actual use of the service. It is worth noting that the speech data was collected from a population whose education levels were representative of the target population who would actually use the service.

Our work here is motivated by previous work in developing the subspace Gaussian mixture model (SGMM) (Povey et al., 2011) and the use of the SGMM for multi-lingual speech recognition reported by Burget et al., 2010, Lu et al., 2011, Lu et al., 2012. Other approaches to training multi-lingual acoustic models involve the use of a so-called common phone set as a means of sharing data between languages (Schultz and Waibel, 2001, Schultz and Kirchhoff, 2006, Vu et al., 2011). Though the use of a common phone set is a popular approach to train multi-lingual acoustic models, it is also quite cumbersome. Since the parametrization of the SGMM naturally allows for the sharing of data between multiple languages, the use of a common phone set is unnecessary, and hence this approach is attractive. The results from previous work in Burget et al. (2010) imply that when there is a limited amount of data in a certain target language, the SGMM parametrization facilitates the use of “shared” parameters that have been estimated reliably from other well-resourced languages for speech recognition in the target language. Authors in Lu et al. (2011) examine the effect that limited training data can have on the estimation of the state-dependent parameters of the SGMM and suggest the modification of using a regularization term for this purpose. Further, authors in Lu et al. (2012) present a maximum a-posteriori (MAP) adaptation method to “adapt” an SGMM acoustic model trained in a related but well-resourced language to a target language in which limited training data is available. The speech data used in all of these studies were collected in noise-free environments, and over close-talking microphones. Given our task domain, it seems appropriate to perform an experimental study detailing some of the issues involved in building SGMM based systems for both mono-lingual and multi-lingual systems for speech data intended for a practical application and in languages that are of interest to a large-part of the developing world.

In this paper we demonstrate the effects of multi-lingual SGMM acoustic model training carried out after cross-corpus acoustic normalization. A study is performed using the Marathi and Hindi language pair since they are linguistically related languages. Our work shows the importance of compensating for the sources of acoustic variability between speech data collected from multiple languages while training multi-lingual SGMM models. The issue of handling speaker and environmental “factors” that are causes for variation in the speech signal has been addressed in Gales (2001a). Further, in Seltzer and Acero (2001) the authors propose the factoring of speaker and environment variability into a pair of cascaded constrained maximum-likelihood linear regression (CMLLR) transforms-one for each source of variability. They propose an adaptive training frame-work to train the acoustic model. Our approach to cross-corpus acoustic normalization is similar in spirit to the approach presented by the authors in Seltzer and Acero (2001) in using a pair of “factored transforms” to compensate for speaker and environmental variability. Further, our work addresses the issue of not having enough well-trained context-dependent states in the Hindi language. To address this issue, context-dependent states in the multi-lingual SGMM are borrowed from the more well-resourced Marathi language. This is complementary to the work presented by Lu et al. (2011), where a regularization term is introduced in the optimization criterion for the state-dependent parameters. This regularization is introduced as means of dealing with limited data in the target language.

To provide the context for this study, we present baseline results for mono-lingual continuous density hidden Markov model (CDHMM) speech recognition performance for Hindi and Marathi in this task domain. We compare the ASR performance of the SGMM based mono-lingual system with respect to the CDHMM baseline (Mohan et al., 2012b). Our experimental study considers the impact of each of the stages of multi-lingual training on the final ASR system performance. Interesting anecdotal results are presented to show that the SGMM’s state-level parameters are able to capture phonetically similar and meaningful information across the two languages. Further, recognition errors made by the final multi-lingual SGMM system on the Hindi test set that are attributed to a lack of adequate context-dependent states are analysed. To this effect, an experimental study that demonstrates the impact of borrowing context-dependent states from the Marathi language is presented. The main contributions of this work include the development of the Multilingual SGMM system for Hindi and Marathi, cross-corpus normalization for multi-lingual training, an analysis of the linguistic similarity between the two languages and the cross-lingual borrowing of contexts from the Marathi (non-target, well-resourced) language.

This paper is organized follows. Section 2 presents a detailed description of the agricultural commodities task domain. Next, Section 3 briefly describes the SGMM in the context of the experimental study. In Section 4 we describe our experimental setup for both the CDHMM and SGMM systems. The comparative performance of baseline mono-lingual CDHMM systems and the mono-lingual SGMM systems is described in Section 5. In Section 6 we provide a description of training and the performance of the multi-lingual SGMM system for Hindi and Marathi, highlighting the effects of acoustic mismatch between the two languages. After describing an algorithm to obtain features normalized for cross-speaker and cross-corpus acoustic variation, we consider the impact of using these for multi-lingual SGMM training. We summarize the performance of the multi-lingual Hindi and Marathi ASR system obtained with these normalized features. Further, an anecdotal experiment is presented in Section 6.4, where it is shown that errors arising due to poor context modelling in Hindi can be mitigated by borrowing contexts from the Marathi language. Next, an analysis of cross-lingual similarity between the languages, based on the cosine distance measure between the individual state-dependent parameters is presented in Section 6.5. Finally, in Section 6.6, a method to borrow context-dependent states based on the cosine distance measure is discussed. The effect of appropriately weighting these Marathi language states and their impact on the Hindi language recognition performance is studied. We conclude this paper by summarizing our findings in Section 7.

Section snippets

Agricultural commodities task domain

This section provides a brief description of the data used in this experimental study. As mentioned in Section 1, we use a subset of the data which has been collected for the project titled-“Speech-based Access for Agricultural Commodity Prices in Six Indian Languages” sponsored by the Government of India. The goal of this project is to implement and deploy a speech based system for farmers to access prices of agricultural commodities in various districts across India from inexpensive mobile

The subspace Gaussian mixture model

This section provides a brief description of the subspace Gaussian mixture model (SGMM) implementation (Rose et al., 2011), proposed by Povey et al. (2011). The description here follows the work of Rose et al. (2011).

For an automatic speech recognition (ASR) system configured with J states, the observation density for a given D dimensional feature vector, $x$ for a state $j \in {1, \dots, J}$ can be written as, $p (x | j) = \sum_{i = 1}^{I} w_{ji} N (x | μ_{ji}, Σ_{i}),$ where I full-covariance Gaussians are shared between the J states. The

Experimental setup for the mono-lingual systems

This section gives a summary of the experimental setup for the mono-lingual systems. Section 4.1 describes the setup of the mono-lingual CDHMM system and Section 4.2 describes the setup of the mono-lingual SGMM systems.

Comparitive performance of the mono-lingual CDHMM and the mono-lingual SGMM systems

This section summarizes the comparative performance between the baseline mono-lingual CDHMM systems and the mono-lingual SGMM systems. We use the Word Accuracy (WAc.) as the performance measure for recognition. We have also provided the percentage correct scores (% Corr.) for comparison. The percentage correct scores (%Corr) perhaps carry more meaning in the context of this task, since for the spoken dialogue system it is important that the transcription of the words in the utterance be

Hindi and Marathi multi-lingual SGMM system

This section summarizes the building of a multi-lingual SGMM system for the closely related language pair – Hindi and Marathi. Both Hindi and Marathi belong to the Indo-Aryan (Cardona, 2003) group of languages. Both languages share a large degree of similarity in terms of syntactic structure, written Devanagari script (Central Hindi Directorate, 1977), and to some extent the vocabulary. The Devanagari script is phonetic in nature. There is more or less a one-to-one correspondence between what

Conclusions

An experimental study was presented that investigated acoustic modelling configurations for speech recognition in the Indian languages – Hindi and Marathi. The experimental study was performed using data from a small vocabulary agricultural commodities task domain that was collected for configuring spoken dialogue systems. Two acoustic modelling techniques for mono-lingual ASR were compared namely – the conventional CDHMM and the SGMM acoustic modelling technique. The SGMM mono-lingual models

Acknowledgements

The authors would like to thank all of the members involved in the data collection effort and the development of the dialogue system for the project “Speech-based Access for Agricultural Commodity Prices in Six Indian Languages” sponsored by the Government of India. We would also like to thank M.S. Research Scholar Raghavengra Bilgi at IIT Madras for his timely help with providing resources for this experimental study.

References (34)

D. Povey et al.
The subspace Gaussian mixture model A structured model for speech recognition
Computer Speech & Language
(2011)
T. Schultz et al.
Language independent and language adaptive acoustic modeling for speech recognition
Speech Communication
(2001)
Bowonder, B., Gupta, V., Singh, A., 2003. Developing a rural market e-hub, the case study of e-Choupal experience of...
L. Burget et al.
Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models
G. Cardona
(2003)
Central Hindi Directorate, I., 1977. Devanagari: development, amplification, and standardisation. Central Hindi...
Chopde, A., 2006. ITRANS-Indian language transliteration package....
M. Gales
Maximum likelihood linear transformations for HMM-based speech recognition
Computer Speech and Language
(1998)
M. Gales
Acoustic factorisation
Gales, M., 2001. Multiple-cluster adaptive training schemes. In: Proceedings of IEEE International Conference on...

L. Gillick et al.

Some statistical issues in the comparison of speech recognition algorithms

D. Hakkani-Tur et al.

Bootstrapping language models for spoken dialog systems from the world wide web

Killer, M., Stuker, S., Schultz, T., 2003. Grapheme based speech recognition. In: Eighth European Conference on Speech...

K. Lee

Automatic Speech Recognition: The Development of the SPHINX System

(1989)

L. Lu et al.

Regularized subspace gaussian mixture models for cross-lingual speech recognition

L. Lu et al.

Maximum a posteriori adaptation of subspace gaussian mixture models for cross-lingual speech recognition

G. Mantena et al.

A speech-based conversation system for accessing agriculture commodity prices in Indian languages

Cited by (41)

Multilingual and multimode phone recognition system for Indian languages
2020, Speech Communication
Citation Excerpt :
Similar consonant-vowel units across the considered languages are merged to train a multilingual speech recognizer. Mohan et al. (2014) developed a small vocabulary multilingual speech recognizer using two linguistically similar Indian languages: Marathi and Hindi. Here, the multilingual speech data collected over mobile telephones is used for training a subspace Gaussian mixture model.
The aim of this paper is to develop a flexible framework capable of automatically recognizing phonetic units present in a speech utterance of any language spoken in any mode. In this study, we considered two modes of speech: conversation and read modes in four Indian languages, namely, Telugu, Kannada, Odia, and Bengali. The proposed approach consists of two stages: (i) Automatic speech mode classification (SMC) and (ii) Automatic phoneme recognition using mode-specific multilingual phone recognition system (MMPRS). The vocal tract and excitation source features are considered for classifying speech modes using feed forward neural networks (FFNNs). The vocal tract, excitation source, and tandem features are used in training deep neural network (DNN)-based multilingual phone recognition systems (MPRSs). The performance of the proposed approach is compared with baseline mode-dependent and mode-independent MPRSs. Experimental results show that the proposed approach which combines both SMC and MMPRS into a single system outperforms the baseline phone recognition systems.
Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi
2022, arXiv
Data Collection and Development of Bengali ASR and TTS for Conversational AI-based Automated Advisories in the Agriculture domain
2022, AIST 2022 - 4th International Conference on Artificial Intelligence and Speech Technology
Using cross-model learnings for the Gram Vaani ASR Challenge 2022
2022, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Enhanced Marathi Speech Recognition Facilitated by Grasshopper Optimisation-Based Recurrent Neural Network
2022, Computer Systems Science and Engineering
Performance Evaluation of Speaker Identification in Language and Emotion Mismatch Conditions on Eastern and North Eastern Low Resource Languages of India
2022, Lecture Notes in Networks and Systems

View all citing articles on Scopus

View full text