Elsevier

Medical Engineering & Physics

Volume 32, Issue 10, December 2010, Pages 1189-1197
Medical Engineering & Physics

Isolated word recognition of silent speech using magnetic implants and sensors

https://doi.org/10.1016/j.medengphy.2010.08.011Get rights and content

Abstract

There are a number of situations where individuals wish to communicate verbally but are unable to use conventional means—so called ‘silent speech’. These include speakers in noisy and covert situations as well as patients who have lost their voice as a result of a laryngectomy or similar procedure. This paper focuses on those who are unable to speak following a laryngectomy and assesses the possibility of speech recognition based on a magnetic implant/sensors system. Permanent magnets are placed on the tongue and lips and the changes in magnetic field resulting from movement during speech are monitored using a set of magnetic sensors. The sensor signals are compared to sets of pre-recorded templates using the dynamic time warping (DTW) method, and the best match is identified. Experimental trials are reported for subjects with intact larynx, typically using 500–1000 utterances used for speaker dependant training and testing. It is shown that recognition rates of over 90% are achievable for vocabularies of at least 57 isolated words: sufficient to drive command-and-control applications.

Introduction

Silent speech interfaces have attracted growing research interest over the past few years with a range of techniques and applications considered. The potential applications include those where the speaker is unable to generate normal speech due to disease or trauma; those where speech is inaudible, for instance due to ambient noise, and those where audible speech is undesirable such as in quiet environments or covert situations. While similar silent interfaces could be applied to all of these situations, the acceptable impact on the user differs with each scenario. Thus a person who is physiologically incapable of speech is likely to be more willing to accept an implanted device which restores their speech than a person with normal speech who wishes to communicate in a noisy environment. The work described in this paper is aimed at those who are unable to speak, in the first instance due to laryngectomy. In this situation a long-term solution is desirable and users are likely to be willing to undergo minor surgical procedures and undertake training in order to use an interface which restores speech communication.

A laryngectomy is surgical removal of the larynx, which may be required for the treatment of laryngeal carcinoma or other destructive diseases of the larynx, or following extensive trauma to the throat. The removal of the larynx inevitably results in the loss of the patient's voice and although a number of methods are available to restore speech, they all have limitations. Sound can be created by swallowing air and belching, forming the sound into words. This ‘oesophageal speech’ is difficult to learn, and fluent speech is impossible. Vibrating the soft tissues of the throat by an electrolarynx creates sound, which can be articulated into speech, but the voice is monotonic, sounds electronic, and can be difficult to understand. In addition, a hand held device is typically required which some users find inconvenient and socially unacceptable since it draws attention to their laryngectomised condition. The current preferred method is to use a small silicone tracheo-oesophageal fistula speech valve that connects the trachea and the oesophagus [1]. Air, powered by the lungs, is diverted through the fistula into the throat which vibrates, and this is formed into speech. These silicone valves work very well initially but rapidly become coated in a biofilm, causing them to fail after an average of only 3–4 months [2], [3], [4], [5]. Various modifications have been tried over the years to discourage the biofilm growth (e.g. [6], [7], [8]), but to date none of these approaches appears to provide a long-term solution to the biofilm problem. The use of ceramic materials, which are resistant to biofilm growth, appears promising [9] but has not yet been tested clinically. Furthermore, the use of speech valves is not possible in all cases, particularly if there is poor healing post radiotherapy, requiring surgical procedures to close the tracheo-oesophageal fistula.

The recognition of speech in the absence of audio cues is by no means new, with lip reading being the longest established and most widely used method by which humans can understand the silent speech of others. Automatic visual lip reading has also been considered for machine interpretation of speech either alone [10] or to augment audio-based recognition [11]. While these systems perform acceptably in laboratory conditions, their performance can be sensitive to variations in lighting conditions and they are unlikely to be acceptable for use in public, owing to the need for a camera to image the mouth. A range of other techniques have been considered to extract signals which may be used to interpret speech. Contact between the tongue and the palate is important in many speech elements and recognition of speech using this contact information has been investigated in [12]. An electropalatograph (palatometer) consisting of 118 contacts was used, and trials involving 50 words were conducted using a range of recognition techniques, resulting in recognition rates of up to 78%. Imaging of the vocal tract, using either ultrasound [13] or radar [14], either alone or in conjunction with optical imaging provides reasonable performance but the sensors involved are not well suited to public use. Non-audible murmur (NAM) microphones, which are capable of detecting whispered speech, may be appropriate as a silent speech interface for users with intact larynx [15], but of course without a larynx there is no sound available to amplify. In this case, an additional vibration source is required in the form of an electrolarynx. The combination of electrolarynx and NAM has been shown to improve naturalness of the speech but reduce intelligibility when compared to the use of an electrolarynx alone [16]. It should also be noted that effective use of these devices requires extensive training. Similarly, measurement of glottal activity using vibration and electromagnetic sensing [17] has been investigated for speech enhancement and sub audible speech detection but, once again, this appears to be more suited to speakers with an intact larynx. Measurement of muscle stimulus signals using electromyography (EMG) techniques has been investigated for speech recognition (with an intact larynx) using small vocabularies and, generally, isolated words [18]. While recognition rates of 70–90% are reported under these circumstances, there are challenges in extending these results to larger vocabularies and continuous speech. In particular, it is noted that the EMG signal depends on the muscle size. This means that it may prove difficult to differentiate between the stimuli to some of the small muscles involved in speech and the larger, non speech-related muscles e.g. the strap muscle, particularly if the user moves during speech. Also, the needle electrodes required for EMG signal detection remain visible and are unsightly and prone to infection.

A further abstraction from acoustic recognition is to look for signals in the brain corresponding to speech or even thinking about speech. A number of approaches based on non-invasive EEG techniques or implanted electrodes are described in Ref. [19]. These techniques clearly have no requirement for functioning vocal apparatus, but so far results are only preliminary. A review and comparison of key silent speech interface technologies (including preliminary work on the system described here [20]) is presented in [21] where each method is evaluated against criteria such as cost, invasiveness, suitability for laryngectomees and market readiness.

The system described in the current paper aims to recognise speech based on measured movement of the articulators but using magnetic implants and sensors. This approach has been selected since it appears to be less obtrusive than some of the sensors described above, less invasive than others and of relatively low cost and providing a reasonable recognition rate, while being suitable for laryngectomees. The proposed approach, the signal processing and the proposed recognition algorithm are described in more detail in Section 2. The experimental trials and the key results are described in Section 3 and discussed, along with directions for future development, in Section 4.

Section snippets

System description

The concept underlying the proposed system is that multiple permanent magnets implanted into the articulators generate a varying three dimensional magnetic field during speech, and that by monitoring the vector magnetic field at a number of points around the face and head, it may be possible to identify patterns corresponding to particular elements of speech and thus recognise the intended speech in the absence of audio information. It should be noted that in our approach the aim is not to

Experimental trials

Two sets of experimental trials have been conducted based on two vocabularies: Vocabulary 1 consisting of digits - “zero” to “nine” and Vocabulary 2, designed to give a more diverse phonetic range, consisting of the digits and a further 47 words selected to include all phones in the ARPAbet [34] (see Table 1).

Recordings of Vocabulary 1 consisted of sets of 10 repetitions of each word in a random order while sets consisting of either 3 or 5 repetitions of Vocabulary 2 were made. Several

Discussion and future work

The results described above indicate that magnetic implants and sensors are capable of generating sufficient information to allow accurate isolated word recognition for small vocabularies. Recognition rates achieved are in excess of 90% for a 57-word vocabulary which contains several similar words. This performance compares favourably with other available technologies (for example, 91% using optical image processing [10], 78% using an electropalatograph [12] and 70–90% using EMG [18] for

Conflict of interest

The authors are not aware of any real or potential conflict of interest.

Acknowledgement

This project has been funded by a generous grant from The Henry Smith Charity and Action Medical Research.

References (37)

  • Ell SR. A retrieval study to investigate the failure of silastic speaking valves used post-laryngectomy. M.D. Thesis....
  • S.E.J. Eerenstein et al.

    First results of the VoiceMaster prosthesis in three centres in the Netherlands

    Clin Otolaryngol

    (2001)
  • E.P.J.M. Everaert et al.

    A new method for in vivo evaluation on surfacemodified silicone rubber voice prostheses

    Eur Arch Otorhinolaryngol

    (1997)
  • F.J.M. Hilgers et al.

    A new problem solving indwelling voice prosthesis, eliminating frequent candida- and ‘under-pressure’-related replacements: Provox ActiValve

    Acta Otolaryngol

    (2003)
  • M.J. Fagan et al.

    Development of a second generation tracheo-oesophageal fistula speech valve

  • T. Hasegawa et al.

    Oral image to voice converter, image input microphone

  • M.J. Russell et al.

    Feature selection for the development of a new artificial larynx

    World Cong Med Phys Biomed Eng

    (2009)
  • B. Denby et al.

    Prospects for a silent speech interface using ultrasound imaging

    (2006)
  • Cited by (63)

    View all citing articles on Scopus
    View full text