Skip to main content

2021 | OriginalPaper | Buchkapitel

Deep AM-FM: Toolkit for Automatic Dialogue Evaluation

verfasst von : Chen Zhang, Luis Fernando D’Haro, Rafael E. Banchs, Thomas Friedrichs, Haizhou Li

Erschienen in: Conversational Dialogue Systems for the Next Decade

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

There have been many studies on human-machine dialogue systems. To evaluate them accurately and fairly, many resort to human grading of system outputs. Unfortunately, this is time-consuming and expensive. The study of AM-FM (Adequacy Metric - Fluency Metric) suggests an automatic evaluation metric, that achieves good performance in terms of correlation with human judgements. AM-FM framework intends to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) The semantic closeness of generated response to the corresponding gold references; (2) The syntactic quality of the sentence construction. However, the original formulation of both adequacy and fluency metrics face some technical limitations. The latent semantic indexing (LSI) approach to AM modeling is not scalable to large amount of data. The bag-of-words representation of sentences fails to capture the contextual information. As for FM modeling, the n-gram language model implementation is not able to capture long-term dependency. Many deep learning approaches, such as the long short-term memory network (LSTM) or transformer-based architectures, are able to address these issues well by providing better contextual-aware sentence representations than the LSI approach and achieving much lower perplexity on benchmarking datasets as compared to the n-gram language model. In this paper, we propose deep AM-FM, a DNN-based implementation of the framework and demonstrate that it achieves promising improvements in both Pearson and Spearman correlation w.r.t human evaluation on the bench-marking DSTC6 End-to-end Conversation Modeling task as compared to its original implementation and other popular automatic metrics.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
3
R: reference, H: system response, j: system index, i: test case index and k: reference index.
 
Literatur
1.
Zurück zum Zitat Banchs RE, D’Haro LF, Li H (2015) Adequacy-fluency metrics: evaluating MT in the continuous space model framework. IEEE/ACM TASLP 23(3):472–482 Banchs RE, D’Haro LF, Li H (2015) Adequacy-fluency metrics: evaluating MT in the continuous space model framework. IEEE/ACM TASLP 23(3):472–482
2.
Zurück zum Zitat Bojanowski P, Grave E, Joulin A et al (2017) Enriching word vectors with subword information. Trans ACL 5:135–146 Bojanowski P, Grave E, Joulin A et al (2017) Enriching word vectors with subword information. Trans ACL 5:135–146
3.
Zurück zum Zitat Chelba C, Mikolov T, Schuster M, et al (2014) One billion word benchmark for measuring progress in statistical language modeling. In: Interspeech 2014 Chelba C, Mikolov T, Schuster M, et al (2014) One billion word benchmark for measuring progress in statistical language modeling. In: Interspeech 2014
4.
Zurück zum Zitat Cho K, Van Merriënboer B, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 Cho K, Van Merriënboer B, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078
5.
Zurück zum Zitat Dai Z, Yang Z, Yang Y, et al (2019) Transformer-XL: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 Dai Z, Yang Z, Yang Y, et al (2019) Transformer-XL: attentive language models beyond a fixed-length context. arXiv preprint arXiv:​1901.​02860
6.
Zurück zum Zitat Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019: human language technologies, vol 1 (Long and Short Papers, pp 4171–4186 Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019: human language technologies, vol 1 (Long and Short Papers, pp 4171–4186
7.
Zurück zum Zitat D’Haro LF, Banchs RE, Hori C, Li H (2019) Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics. Comput Speech Lang 55:200–215CrossRef D’Haro LF, Banchs RE, Hori C, Li H (2019) Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics. Comput Speech Lang 55:200–215CrossRef
8.
Zurück zum Zitat Hill F, Cho K, Korhonen A (2016) Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483 Hill F, Cho K, Korhonen A (2016) Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:​1602.​03483
9.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
11.
Zurück zum Zitat Kiros R, Zhu Y, Salakhutdinov RR, et al (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302 Kiros R, Zhu Y, Salakhutdinov RR, et al (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302
12.
Zurück zum Zitat Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Proces 25(2–3):259–284CrossRef Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Proces 25(2–3):259–284CrossRef
13.
Zurück zum Zitat Liu CW, Lowe R, Serban IV, et al (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: EMNLP 2016, pp 2122–2132 Liu CW, Lowe R, Serban IV, et al (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: EMNLP 2016, pp 2122–2132
14.
Zurück zum Zitat Kudo T, Richardson J (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 Kudo T, Richardson J (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:​1808.​06226
15.
16.
17.
Zurück zum Zitat Marelli M, Bentivogli L, Baroni M, et al (2014) Semeval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: SemEval 2014, pp 1–8 Marelli M, Bentivogli L, Baroni M, et al (2014) Semeval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: SemEval 2014, pp 1–8
18.
Zurück zum Zitat Mei H, Bansal M, Walter MR (2017) Coherent dialogue with attention-based language models. In: Thirty-first AAAI conference on artificial intelligence, February 2017 Mei H, Bansal M, Walter MR (2017) Coherent dialogue with attention-based language models. In: Thirty-first AAAI conference on artificial intelligence, February 2017
19.
Zurück zum Zitat Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: InterSpeech 2011 Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: InterSpeech 2011
20.
Zurück zum Zitat Palangi H, Deng L, Shen Y et al (2016) Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM TASLP 24(4):694–707 Palangi H, Deng L, Shen Y et al (2016) Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM TASLP 24(4):694–707
21.
22.
Zurück zum Zitat Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. In: OpenAI Blog, vol 1, no 8 Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. In: OpenAI Blog, vol 1, no 8
23.
Zurück zum Zitat Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. In: Interspeech 2012 Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. In: Interspeech 2012
24.
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008
25.
Zurück zum Zitat Adiwardana D, Luong MT, So DR, Hall J, Fiedel N, Thoppilan R, Le QV (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977 Adiwardana D, Luong MT, So DR, Hall J, Fiedel N, Thoppilan R, Le QV (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:​2001.​09977
Metadaten
Titel
Deep AM-FM: Toolkit for Automatic Dialogue Evaluation
verfasst von
Chen Zhang
Luis Fernando D’Haro
Rafael E. Banchs
Thomas Friedrichs
Haizhou Li
Copyright-Jahr
2021
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-15-8395-7_5

Neuer Inhalt