Elsevier

Speech Communication

Volume 23, Issues 1–2, October 1997, Pages 113-127
Speech Communication

How may I help you?

https://doi.org/10.1016/S0167-6393(97)00040-XGet rights and content

Abstract

We are interested in providing automated services via natural spoken dialog systems. By natural, we mean that the machine understands and acts upon what people actually say, in contrast to what one would like them to say. There are many issues that arise when such systems are targeted for large populations of non-expert users. In this paper, we focus on the task of automatically routing telephone calls based on a user's fluently spoken response to the open-ended prompt of “How may I help you?”. We first describe a database generated from 10,000 spoken transactions between customers and human agents. We then describe methods for automatically acquiring language models for both recognition and understanding from such data. Experimental results evaluating call-classification from speech are reported for that database. These methods have been embedded within a spoken dialog system, with subsequent processing for information retrieval and formfilling.

Résumé

Nous sommes intéressés par la production de services automatisés par des systèmes de dialogue utilisant la parole naturelle. Nous entendons par naturel que la machine comprend et agit selon ce que les personnes effectivement disent, en opposition à ce que l'on aimerait qu'ils disent. Plusieurs problèmes apparaissent quand de tels systèmes sont visés pour une population large d'utilisateurs qui ne sont pas des experts. Dans ce papier, nous focalisons sur la tâche de routage automatique des appels téléphoniques se basant sur la réponse spontanée des utilisateurs à la question ouverte “How may I help you?”. Nous décrivons d'abord la base de données générées par 1000 transactions orales entre des utilisateurs et des agents humains. Nous décrivons ensuite les méthodes pour l'acquisition automatique, à partir des données, des modèles de langage pour la reconnaissance et la compréhension. Les résultats expérimentaux pour l'évaluation de la classification des appels sont rapportés pour cette base de données. Ces méthodes ont été incorporées dans un système de dialogue oral avec des traitements subséquents pour le tri des informations et le remplissage des formes.

Introduction

There are a wide variety of interactive voice systems in the world, some residing in laboratories, many actually deployed. Most of these systems, however, either explicitly prompt the user at each stage of the dialog, or assume that the person has already learned the permissible vocabulary and grammar at each point. While such an assumption is conceivable for frequent expert users, it is dubious at best for a general population on even moderate complexity tasks. In this work, we describe progress towards an experimental system which shifts the burden from human to machine, making it the device's responsibility to respond appropriately to what people actually say.

The problem of automatically understanding fluent speech is difficult, at best. There is, however, the promise of solution within constrained task domains. In particular, we focus on a system whose initial goal is to understand its input sufficiently to route the caller to an appropriate destination in a telecommunications environment. Such a call router need not solve the user's problem, but only transfer the call to someone or something which can. For example, if the input is “Can I reverse the charges on this call?”, then the caller should be connected to an existing automated subsystem which completes collect calls. Another example might be “How do I dial direct to Tokyo?”, whence the call should be connected to a human agent who can provide dialing instructions. Such a call router should be contrasted with traditional telephone switching, wherein a user must know the phone number of their desired destination, or in recent years navigate a menu system to self-select the desired service. In the method described here, the call is instead routed based on the meaning of the user's speech.

This paper proceeds as follows. In Section 2, an experimental spoken dialog system is described for call-routing plus subsequent automatic processing of information retrieval and form-filling functions. The dialog is based upon a feedback control model, where at each stage the user can provide both information plus feedback as to the appropriateness of the machine's response (Gorin, 1995a). In Section 3, a database is described of 10 K fluently spoken transactions between customers and human agents for this task. In particular, we describe the language variability in the first customer utterance, responding to the prompt of “How may I help you?” in a telecommunications environment.

In Section 4, we describe the spoken language understanding (SLU) algorithms which we exploit for call classification. A central notion in this work is that it is not necessary to recognize and understand every nuance of the speech, but only those fragments which are salient for the task (Gorin, 1995a). This leads to a methodology where understanding is based upon recognition of such salient fragments and combinations thereof.

There are three main components in our SLU methodology. First is to automatically acquire salient grammar fragments from the data, modeling those parts of the language which are meaningful for the task plus their statistical associations to the machine actions. Second is to recognize these fragments in fluent speech, searching the output of a large vocabulary speech recognizer. The statistical language model which constrains this recognizer embeds automatically-acquired fragments in a stochastic finite state machine, providing an efficient approximation to an n-gram model with variable length units (Riccardi et al., 1996). Third, we exploit these multiple recognized fragments to classify the call-type of an utterance. Since the SLU is embedded within a dialog system, the classifier provides both the best (rank 1) and secondary (rank 2, etc.) decisions. Finally, in Section 5, we report on experimental results for call-classification from the above-mentioned speech database, training on 8 K utterances and testing on 1 K.

Section snippets

A spoken dialog system

The goal of a call-router is to recognize and understand the user's speech sufficiently to determine the call-type. Dialog is necessary since, in many situations, the call type cannot be determined from a single input. This can be due to an ambiguous request or to imperfect performance of the spoken language understanding (SLU) algorithms.

One important component of dialog is confirmation, wherein the machine proposes its understanding of the user's input, receiving reinforcement feedback as to

Database

In order to enable experimental evaluation, we generated a database of 10 K spoken transactions between customers and human agents. First, both channels of the dialog were recorded from the agents' headset jacks onto a digital audio tape (DAT). At the end of each transaction, a control key was manually depressed (by the human agent) to generate a DTMF code, serving both as a segmentation marker and a call-type label. These recordings were then automatically segmented, filtered and downsampled to

Algorithms

In this section, we describe the algorithms underlying this system and experiments. A key notion is that for any particular task, it is not necessary to recognize and understand every word and nuance in an utterance. That is, to extract semantic information from spoken language, it suffices to focus on the salient fragments and combinations thereof. There are three major issues that we address:

  • How do we acquire the salient grammar fragments for this task?

  • How can we recognize these fragments in

Experiment results

The database of Section 3was divided into 8 K training and 1 K test utterances. The remainder of the 10 K database has been reserved for future validation experiments. Salient phrase fragments were automatically generated from the training transcriptions and associated call-types via the methods of Section 4.1. In particular, the length of these fragments was restricted to four or less and to have training-set frequency of five or greater. An initial filtering was imposed so that the peak of the a

Conclusions

We have described progress towards a natural spoken dialog system for automated services. By natural, we mean that the machine understands and acts upon what people actually say, in contrast to what one would like them to say. A first stage in this system is call-classification, i.e., routing a caller depending on the meaning of their fluently spoken response to “How may I help you?” We have proposed algorithms for automatically acquiring language models for both recognition and understanding,

Acknowledgements

The authors wish to thank Larry Rabiner, Jay Wilpon, David Roe, Barry Parker and Jim Scherer for their support and encouragement of this research. We also thank Mike Riley and Andrej Ljolje for many hours of useful discussion on the ASR aspects of this effort. We finally thank our colleagues Alicia Abella, Tirso Alonso, Egbert Ammicht and Susan Boyce for their continued collaboration in the creation of a spoken dialog system exploiting the methods of this paper.

References (29)

  • A Ljolje

    High accuracy phone recognition using context clustering and quasi-triphonic models

    Comp. Speech Lang.

    (1994)
  • T Matsumura et al.

    Non-uniform unit based HMMs for continuous speech recognition

    Speech Communication

    (1995)
  • G Riccardi et al.

    Stochastic automata for language modeling

    Comp. Speech Lang.

    (1996)
  • Abella, A., Brown, M., Buntschuh, B., 1996. Developing principles for dialog-based interfaces. In: Proc. ECAI Spoken...
  • N.M Blachman

    The amount of information that y gives about x

    IEEE Trans. Inform. Theory

    (1968)
  • Boyce, S., Gorin, A.L., 1996. User interface issues for natural spoken dialog systems. In: Proc. Internat. Symp. on...
  • Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley, New...
  • Garner, P.N., Hemsworth, A., 1997. A keyword selection strategy for dialog move recognition and multi-class topic...
  • Gertner, A.N., Gorin, A.L., 1993. Adaptive language acquisition for an airline information subsystem. In: Mammone, R....
  • Giachin, E., 1995. Phrase bigrams for continuous speech recognition. In: Proc. Internat. Conf. Acoust. Speech Signal...
  • A.L Gorin

    On automated language acquisition

    J. Acoust. Soc. Amer.

    (1995)
  • Gorin, A.L., 1995b. Spoken dialog as a feedback control system. In: Proc. ESCA Workshop on Spoken Dialog Systems,...
  • Gorin, A.L., 1996. Processing of semantic information in fluently spoken language. In: Proc. Internat. Conf. on Spoken...
  • A.L Gorin et al.

    An experiment in spoken language acquisition

    IEEE Trans. Speech and Audio

    (1994)
  • Cited by (443)

    • A Two-Stage Selective Fusion Framework for Joint Intent Detection and Slot Filling

      2024, IEEE Transactions on Neural Networks and Learning Systems
    • Investigating the Intervention in Parallel Conversations

      2023, ACM International Conference Proceeding Series
    View all citing articles on Scopus
    1

    E-mail: [email protected].

    2

    E-mail: [email protected]. On leave of absence from the University of Bristol, UK.

    View full text