How may I help you?

doi:10.1016/S0167-6393(97)00040-X

Speech Communication

Volume 23, Issues 1–2, October 1997, Pages 113-127

https://doi.org/10.1016/S0167-6393(97)00040-X Get rights and content

Abstract

We are interested in providing automated services via natural spoken dialog systems. By natural, we mean that the machine understands and acts upon what people actually say, in contrast to what one would like them to say. There are many issues that arise when such systems are targeted for large populations of non-expert users. In this paper, we focus on the task of automatically routing telephone calls based on a user's fluently spoken response to the open-ended prompt of “How may I help you?”. We first describe a database generated from 10,000 spoken transactions between customers and human agents. We then describe methods for automatically acquiring language models for both recognition and understanding from such data. Experimental results evaluating call-classification from speech are reported for that database. These methods have been embedded within a spoken dialog system, with subsequent processing for information retrieval and formfilling.

Résumé

Nous sommes intéressés par la production de services automatisés par des systèmes de dialogue utilisant la parole naturelle. Nous entendons par naturel que la machine comprend et agit selon ce que les personnes effectivement disent, en opposition à ce que l'on aimerait qu'ils disent. Plusieurs problèmes apparaissent quand de tels systèmes sont visés pour une population large d'utilisateurs qui ne sont pas des experts. Dans ce papier, nous focalisons sur la tâche de routage automatique des appels téléphoniques se basant sur la réponse spontanée des utilisateurs à la question ouverte “How may I help you?”. Nous décrivons d'abord la base de données générées par 1000 transactions orales entre des utilisateurs et des agents humains. Nous décrivons ensuite les méthodes pour l'acquisition automatique, à partir des données, des modèles de langage pour la reconnaissance et la compréhension. Les résultats expérimentaux pour l'évaluation de la classification des appels sont rapportés pour cette base de données. Ces méthodes ont été incorporées dans un système de dialogue oral avec des traitements subséquents pour le tri des informations et le remplissage des formes.

Introduction

There are a wide variety of interactive voice systems in the world, some residing in laboratories, many actually deployed. Most of these systems, however, either explicitly prompt the user at each stage of the dialog, or assume that the person has already learned the permissible vocabulary and grammar at each point. While such an assumption is conceivable for frequent expert users, it is dubious at best for a general population on even moderate complexity tasks. In this work, we describe progress towards an experimental system which shifts the burden from human to machine, making it the device's responsibility to respond appropriately to what people actually say.

The problem of automatically understanding fluent speech is difficult, at best. There is, however, the promise of solution within constrained task domains. In particular, we focus on a system whose initial goal is to understand its input sufficiently to route the caller to an appropriate destination in a telecommunications environment. Such a call router need not solve the user's problem, but only transfer the call to someone or something which can. For example, if the input is “Can I reverse the charges on this call?”, then the caller should be connected to an existing automated subsystem which completes collect calls. Another example might be “How do I dial direct to Tokyo?”, whence the call should be connected to a human agent who can provide dialing instructions. Such a call router should be contrasted with traditional telephone switching, wherein a user must know the phone number of their desired destination, or in recent years navigate a menu system to self-select the desired service. In the method described here, the call is instead routed based on the meaning of the user's speech.

This paper proceeds as follows. In Section 2, an experimental spoken dialog system is described for call-routing plus subsequent automatic processing of information retrieval and form-filling functions. The dialog is based upon a feedback control model, where at each stage the user can provide both information plus feedback as to the appropriateness of the machine's response (Gorin, 1995a). In Section 3, a database is described of 10 K fluently spoken transactions between customers and human agents for this task. In particular, we describe the language variability in the first customer utterance, responding to the prompt of “How may I help you?” in a telecommunications environment.

In Section 4, we describe the spoken language understanding (SLU) algorithms which we exploit for call classification. A central notion in this work is that it is not necessary to recognize and understand every nuance of the speech, but only those fragments which are salient for the task (Gorin, 1995a). This leads to a methodology where understanding is based upon recognition of such salient fragments and combinations thereof.

There are three main components in our SLU methodology. First is to automatically acquire salient grammar fragments from the data, modeling those parts of the language which are meaningful for the task plus their statistical associations to the machine actions. Second is to recognize these fragments in fluent speech, searching the output of a large vocabulary speech recognizer. The statistical language model which constrains this recognizer embeds automatically-acquired fragments in a stochastic finite state machine, providing an efficient approximation to an n-gram model with variable length units (Riccardi et al., 1996). Third, we exploit these multiple recognized fragments to classify the call-type of an utterance. Since the SLU is embedded within a dialog system, the classifier provides both the best (rank 1) and secondary (rank 2, etc.) decisions. Finally, in Section 5, we report on experimental results for call-classification from the above-mentioned speech database, training on 8 K utterances and testing on 1 K.

Section snippets

A spoken dialog system

The goal of a call-router is to recognize and understand the user's speech sufficiently to determine the call-type. Dialog is necessary since, in many situations, the call type cannot be determined from a single input. This can be due to an ambiguous request or to imperfect performance of the spoken language understanding (SLU) algorithms.

One important component of dialog is confirmation, wherein the machine proposes its understanding of the user's input, receiving reinforcement feedback as to

Database

In order to enable experimental evaluation, we generated a database of 10 K spoken transactions between customers and human agents. First, both channels of the dialog were recorded from the agents' headset jacks onto a digital audio tape (DAT). At the end of each transaction, a control key was manually depressed (by the human agent) to generate a DTMF code, serving both as a segmentation marker and a call-type label. These recordings were then automatically segmented, filtered and downsampled to

Algorithms

In this section, we describe the algorithms underlying this system and experiments. A key notion is that for any particular task, it is not necessary to recognize and understand every word and nuance in an utterance. That is, to extract semantic information from spoken language, it suffices to focus on the salient fragments and combinations thereof. There are three major issues that we address:

•
How do we acquire the salient grammar fragments for this task?
•
How can we recognize these fragments in

Experiment results

The database of Section 3was divided into 8 K training and 1 K test utterances. The remainder of the 10 K database has been reserved for future validation experiments. Salient phrase fragments were automatically generated from the training transcriptions and associated call-types via the methods of Section 4.1. In particular, the length of these fragments was restricted to four or less and to have training-set frequency of five or greater. An initial filtering was imposed so that the peak of the a

Conclusions

We have described progress towards a natural spoken dialog system for automated services. By natural, we mean that the machine understands and acts upon what people actually say, in contrast to what one would like them to say. A first stage in this system is call-classification, i.e., routing a caller depending on the meaning of their fluently spoken response to “How may I help you?” We have proposed algorithms for automatically acquiring language models for both recognition and understanding,

Acknowledgements

The authors wish to thank Larry Rabiner, Jay Wilpon, David Roe, Barry Parker and Jim Scherer for their support and encouragement of this research. We also thank Mike Riley and Andrej Ljolje for many hours of useful discussion on the ASR aspects of this effort. We finally thank our colleagues Alicia Abella, Tirso Alonso, Egbert Ammicht and Susan Boyce for their continued collaboration in the creation of a spoken dialog system exploiting the methods of this paper.

References (29)

A Ljolje
High accuracy phone recognition using context clustering and quasi-triphonic models
Comp. Speech Lang.
(1994)
T Matsumura et al.
Non-uniform unit based HMMs for continuous speech recognition
Speech Communication
(1995)
G Riccardi et al.
Stochastic automata for language modeling
Comp. Speech Lang.
(1996)
Abella, A., Brown, M., Buntschuh, B., 1996. Developing principles for dialog-based interfaces. In: Proc. ECAI Spoken...
N.M Blachman
The amount of information that y gives about x
IEEE Trans. Inform. Theory
(1968)
Boyce, S., Gorin, A.L., 1996. User interface issues for natural spoken dialog systems. In: Proc. Internat. Symp. on...
Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley, New...
Garner, P.N., Hemsworth, A., 1997. A keyword selection strategy for dialog move recognition and multi-class topic...
Gertner, A.N., Gorin, A.L., 1993. Adaptive language acquisition for an airline information subsystem. In: Mammone, R....
Giachin, E., 1995. Phrase bigrams for continuous speech recognition. In: Proc. Internat. Conf. Acoust. Speech Signal...

A.L Gorin

On automated language acquisition

J. Acoust. Soc. Amer.

(1995)

Gorin, A.L., 1995b. Spoken dialog as a feedback control system. In: Proc. ESCA Workshop on Spoken Dialog Systems,...

Gorin, A.L., 1996. Processing of semantic information in fluently spoken language. In: Proc. Internat. Conf. on Spoken...

A.L Gorin et al.

An experiment in spoken language acquisition

IEEE Trans. Speech and Audio

(1994)

Cited by (443)

Multitask learning for multilingual intent detection and slot filling in dialogue systems
2023, Information Fusion
Dialogue systems are becoming an ubiquitous presence in our everyday lives having a huge impact on business and society. Spoken language understanding (SLU) is the critical component of every goal-oriented dialogue system or any conversational system. The understanding of the user utterance is crucial for assisting the user in achieving their desired objectives. Future-generation systems need to be able to handle the multilinguality issue. Hence, the development of conversational agents becomes challenging as it needs to understand the different languages along with the semantic meaning of the given utterance. In this work, we propose a multilingual multitask approach to fuse the two primary SLU tasks, namely, intent detection and slot filling for three different languages. While intent detection deals with identifying user’s goal or purpose, slot filling captures the appropriate user utterance information in the form of slots. As both of these tasks are highly correlated, we propose a multitask strategy to tackle these two tasks concurrently. We employ a transformer as a shared sentence encoder for the three languages, i.e., English, Hindi, and Bengali. Experimental results show that the proposed model achieves an improvement for all the languages for both the tasks of SLU. The multi-lingual multi-task (MLMT) framework shows an improvement of more than 2% in case of intent accuracy and 3% for slot F1 score in comparison to the single task models. Also, there is an increase of more than 1 point intent accuracy and 2 points slot F1 score in the MLMT model as opposed to the language specific frameworks.
Conversational artificial intelligence in the AEC industry: A review of present status, challenges and opportunities
2023, Advanced Engineering Informatics
The idea of developing a system that can converse and understand human languages has been around since the 1200 s. With the advancement in artificial intelligence (AI), Conversational AI came of age in 2010 with the launch of Apple’s Siri. Conversational AI systems leveraged Natural Language Processing (NLP) to understand and converse with humans via speech and text. These systems have been deployed in sectors such as aviation, tourism, and healthcare. However, the application of Conversational AI in the architecture engineering and construction (AEC) industry is lagging, and little is known about the state of research on Conversational AI. Thus, this study presents a systematic review of Conversational AI in the AEC industry to provide insights into the current development and conducted a Focus Group Discussion to highlight challenges and validate areas of opportunities. The findings reveal that Conversational AI applications hold immense benefits for the AEC industry, but it is currently underexplored. The major challenges for the under exploration were highlighted and discusses for intervention. Lastly, opportunities and future research directions of Conversational AI are projected and validated which would improve the productivity and efficiency of the industry. This study presents the status quo of a fast-emerging research area and serves as the first attempt in the AEC field. Its findings would provide insights into the new field which be of benefit to researchers and stakeholders in the AEC industry.
A Two-Stage Selective Fusion Framework for Joint Intent Detection and Slot Filling
2024, IEEE Transactions on Neural Networks and Learning Systems
Investigating the Effects of Dialogue Summarization on Intervention in Human-System Collaborative Dialogue
2023, ACM International Conference Proceeding Series
Investigating the Intervention in Parallel Conversations
2023, ACM International Conference Proceeding Series
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM
2023, arXiv

View all citing articles on Scopus

¹: E-mail: [email protected].

²: E-mail: [email protected]. On leave of absence from the University of Bristol, UK.

View full text