Automatic speech recognition for under-resourced languages: A survey
Introduction
Nowadays, computers are heavily used to communicate via text and speech. Text processing tools, electronic dictionaries, and advanced speech processing systems like text-to-speech (speech generation) and speech-to-text (speech recognition) systems are readily available for several languages. There are however more than 6900 languages in the world and only a small fraction offers the resources required for implementation of Human Language Technologies (HLT). Thus, HLT are mostly concerned with languages for which large resources are available or which have suddenly become of interest because of the economic or political scene. Unfortunately, most languages from developing countries or minorities received only little attention so far. One way of improving this “language divide” is to do more research on the portability of speech and language technologies for multilingual applications, especially for under-resourced languages.
This paper is a review on automatic speech recognition (ASR) for under-resourced (UR) languages, which have shown a growing interest in the recent years. While the task of ASR is rather specific, some issues addressed in this paper apply to other HLT tasks as well. This paper is organized as follows: After an Introduction that focuses on the language diversity and on our motivation to address the topic, Section 2 gives a brief definition of what we call “under-resourced languages”, as well as the challenges associated to them. Section 3 is a literature review of the recent contributions made in ASR for under-resourced languages. Examples of past projects on this topic are given in Section 4, while Section 5 presents the future trends when dealing with under-resourced languages. Finally, Section 6 concludes this work.
Counting the number of languages in the world is not a straightforward task. First, one has to define what makes a language, for example to decide if dialects are considered to be a language, if so, which ones should be added, or if not, to draw the line between a language and a dialect. An estimate for the total number of living languages in the world can be found on the Ethnologue1 web site. They define a living language as “one that has at least one speaker for whom it is their first language”. So, extinct languages and languages that are spoken as a second language are excluded from these counts. Based on this definition, Ethnologue lists 6909 known living languages. This list includes 473 languages that are classified as nearly extinct, i.e. when “only a few elderly speakers are still living”. It is important to note that Ethnologue’s list includes both verbal and visual-kinetic spoken languages. The latter ones are known as sign languages, which are used for everyday communication by the deaf; these spoken languages combine hand gestures with lips articulation and facial mimics. Almost all countries over the world define their own national sign languages.
Counting how many languages have a written form is also subject to controversy. The foundation for endangered languages web site2 mentions 2000 written languages by counting published bibles (entirely or portions) but this also includes non-living languages. Omniglot,3 an online encyclopedia of writing systems and languages, lists less than 1000 written languages and gives details on more than 180 different writing systems.
While counting languages is a tricky task, the number of “well-resourced languages” can be easily given by listing how many languages are identified for core technologies and resources, such as: Google Translate (63 languages involved4 in 2012), Google search (more than one hundred languages in 2012), Siri ASR application (8 languages in 2012), Wiktionary5 (∼80 languages in 2012), Google Voice Search (29 languages and accents in 2012).
In today’s globalized world, languages are disappearing at an alarming rate. Crystal (2000) estimated that over the next century about half of all existing languages will be extinct. On average, one could say that every two weeks one language dies. A survey by the Summer Institute of Linguistics (SIL) from February 1999 revealed that about 51 languages are left with only one speaker, 500 languages have 500 speakers left, and 3000 languages have less than 10.000 speakers left. The graph below summarizes the estimates of speakers over languages from the SIL survey. It shows that 96% of the world’s languages are spoken by only 4% of its people.
History has shown that not even a language with 100.000 remaining speakers is safe from extinction (Crystal, 2000). The survival of a language depends on the pressure imposed on that language and on its speakers. Pressure may arise from disasters (earthquakes on Papua New Guinea killed several languages), genocide (about 90% of America’s natives died within 200 years of European conquering) or simply from the dominance of another language. The latter may result in cultural assimilation (social, political or economic benefits to speak the dominant language) that usually leads to the loss of the suppressed language within few generations (e.g. second generation immigrants).
How could language extinction be slowed down and what are the associated costs. First of all, a language can only be saved if the community itself wants it and the surrounding culture respects this wish. Typically, the community is then supported to fund courses, materials, and teachers. In addition, linguists go into the field, collect and publish language related information such as grammars, dictionaries, speech recordings, and make them available to the public at large. The associated costs depend on the particular conditions, for example if the language has a writing system, etc. Crystal estimates about USD 80.000 per year per language. Considering 3000 endangered languages this would add up to more than USD 700 Million. Organizations like the Foundation of Endangered Languages (FEL) and large-scale UNESCO projects have been established to raise both, attention and funds, to tackle this major challenge (see Fig. 1).
Some languages might be more attractive than others for Human Language Technologies (HLT). However, for the reasons described above, there are good reasons for developing speech recognition (and other technologies like machine translation) systems for literally all languages in the world. First of all, spoken language is the primary means of human communication. Both, individual and community memories, ideas, major events, practices, and lessons learned are all preserved and transmitted through language. Furthermore, language is not only a communication tool but fundamental to cultural identity and empowerment. So, language diversity in the world is the basis of our rich cultural heritage and diversity. If the world loses a language, the memories and experiences of this culture go with it. Crystal claims that language diversity should be treated like bio-diversity as history has shown that the more diverse eco-systems are strongest.
Human Language Technologies have a lot to offer to revitalize and (at least) document languages and thus prevent or slow down language extinction. The existence of technology may raise interest and make the language attractive again to their native speakers. Moreover, in the perspective of saving some endangered languages (some mostly spoken and not written), the possibility to rapidly develop ASR systems to transcribe them is an important step for their preservation and would facilitate access to audio contents in these languages. A second reason why HLT should be available for all languages is that the political impact of a language can be very volatile. In today’s world, language is one of the few remaining barriers that hinder human-to-human interaction. Events such as armed conflicts or natural disasters might make it important to be able to communicate with speakers of a less-prevalent language, e.g. for humanitarian workers in a disaster area (see, for instance, the earthquake in Haiti that highlighted the need for technologies to handle Haitian Creole language6). Often, the people that one need to communicate with in such a scenario only speaks their own language that is unknown to the outsider, e.g. a foreign doctor trying to help. For these cases, human translators are often not available in necessary numbers and in a timely manner. Here, readily available technology such as speech translation systems can be highly beneficial. Such technology might be far from being perfect, but when being faced with the alternative of having no translation system at all for an unknown language in an emergency situation, the imperfect system will be of great use. Therefore, HLT should be developed especially for under-resourced languages. Last but not least, some under-resourced languages may blossom in the future to become of very strong social, political, or economic power (see for instance languages from rapidly developing countries, such as: Bengali, Malay, Vietnamese, Urdu or vehicular languages from Africa – Swahili, Wolof – some of them already being in the top-20 of the most spoken languages in the world).
Section snippets
Definition
The term “under-resourced languages” introduced by Krauwer, 2003, Berment, 2004 refers to a language with some of (if not all) the following aspects: lack of a unique writing system or stable orthography, limited presence on the web, lack of linguistic expertise, lack of electronic resources for speech and language processing, such as monolingual corpora, bilingual electronic dictionaries, transcribed speech data, pronunciation dictionaries, vocabulary lists, etc. The synonyms for the same
Components of ASR systems
Automatic speech recognition (ASR) converts a speech signal into a textual representation, i.e. sequence of said words by means of an algorithm implemented as a software or hardware module. Several types of natural speech and corresponding ASR systems are identified: spelled speech (with pauses between letters or phonemes), isolated speech (with pauses between words), continuous speech (when a speaker does not make any pauses between words), spontaneous speech (e.g. in a human-to-human dialog),
Voice search in three South African languages
South Africa is a highly diverse country, with wide social disparities and eleven official languages. Technology projects that address social issues while also bridging language barriers have therefore achieved substantial attention in South African in recent years (Barnard et al., 2010), and substantial progress has been made in developing speech resources and systems that encompass all eleven languages. A highly visible (and commercially relevant) result of this activity was the development
Endangered languages
As already said, language diversity is fragile as some languages are threatened or in real danger of extinction. With such a perspective, revitalization and documentation programs are emerging.26 So, while there is commercial interest in enabling the ∼300 most widely spoken languages in the digital domain (if digital technologies work for this group of languages that represents 95% of humanity), there
Conclusion
Our survey and the papers in this Special Issue demonstrate that speech processing for under-resourced languages is an active field of research, which has experienced significant progress during the past decade. The current review has focused on speech recognition, since that is the area which has been the most significant focus of research for these languages; however, it should be clear that many of the issues and approaches apply to speech technology in general. Although much of the recent
References (138)
- et al.
A unified language model for large vocabulary continuous speech recognition of Turkish
Signal Processing
(2006) - et al.
Structured language model
Computer Speech and Language
(2000) - et al.
Multi-lingual spoken language understanding in the MIT voyager system
Speech Communication
(1995) - et al.
Morpho-syntactic postprocessing of N-best lists for improved French automatic speech recognition
Computer Speech and Language
(2010) - et al.
Large vocabulary continuous speech recognition of an inflected language using stems and endings
Speech Communication
(2007) - Abdillahi, N., Nocera, P., Bonastre, J.-F., 2006. Automatic transcription of Somali language. In: ICSLP’06, Pittsburgh,...
- Ablimit, M., Neubig, G., Mimura, M., Mori, S., Kawahara, T., Hamdulla, A., 2010. Uyghur Morpheme-based language models...
- Adda-Decker, M., 2003. A corpus-based decompounding algorithm for German lexical modeling in LVCSR. In: Proc....
- Arisoy, E., Sainath, T.N., Kingsbury, B., Ramabhadran, B., 2012. Deep neural network language models. In: Proc....
- Barnard, E., Davel, M., van Heerden, C., 2009. ASR corpus design for resource-scarce languages. In: Proc. Interspeech,...
Morph-based speech recognition and modeling of out-of-vocabulary words across languages
ACM Transactions on Speech and Language Processing
Cited by (417)
Improved spell corrector algorithm and deepspeech2 model for enhancing end-to-end Gujarati language ASR performance
2024, e-Prime - Advances in Electrical Engineering, Electronics and EnergyMultilingual personalized hashtag recommendation for low resource Indic languages using graph-based deep neural network
2024, Expert Systems with ApplicationsApplication of artificial neural networks to predict the particle-scale contact force of photoelastic disks
2024, Advanced Powder TechnologyAddressing the semi-open set dialect recognition problem under resource-efficient considerations
2023, Speech CommunicationMDVC corpus: empowering Moroccan Darija speech recognition
2024, Indonesian Journal of Electrical Engineering and Computer Science