Skip to main content
Top
Published in: Minds and Machines 4/2020

04-11-2020 | General Article

Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too

Author: José Hernández-Orallo

Published in: Minds and Machines | Issue 4/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the last 20 years the Turing test has been left further behind by new developments in artificial intelligence. At the same time, however, these developments have revived some key elements of the Turing test: imitation and adversarialness. On the one hand, many generative models, such as generative adversarial networks (GAN), build imitators under an adversarial setting that strongly resembles the Turing test (with the judge being a learnt discriminative model). The term “Turing learning” has been used for this kind of setting. On the other hand, AI benchmarks are suffering an adversarial situation too, with a ‘challenge-solve-and-replace’ evaluation dynamics whenever human performance is ‘imitated’. The particular AI community rushes to replace the old benchmark by a more challenging benchmark, one for which human performance would still be beyond AI. These two phenomena related to the Turing test are sufficiently distinctive, important and general for a detailed analysis. This is the main goal of this paper. After recognising the abyss that appears beyond superhuman performance, we build on Turing learning to identify two different evaluation schemas: Turing testing and adversarial testing. We revisit some of the key questions surrounding the Turing test, such as ‘understanding’, commonsense reasoning and extracting meaning from the world, and explore how the new testing paradigms should work to unmask the limitations of current and future AI. Finally, we discuss how behavioural similarity metrics could be used to create taxonomies for artificial and natural intelligence. Both testing schemas should complete a transition in which humans should give way to machines—not only as references to be imitated but also as judges—when pursuing and measuring machine intelligence.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
4
This separation is well-known in computer science, at least between solving and verifying. For instance, NP problems can be verified easily (in polynomial time), but unless P=NP, we know that solving these problems is much harder than verifying them. For the “cognitive-judge problem” we must distinguish producing, solving and verifying instances, and realise that any of the three can be harder than the others.
 
5
In some of the cases above, we are assuming that labelling requires human cognitive effort, such as the bird species example where a human must look at the images. But labelling could have been done in other ways, such as a DNA test.
 
6
In language models, ‘perplexity’ is a very common automatic metric, which basically measures how well the model anticipates the next words in a sentence, and a proxy of how well the model compresses the data. Compression has been connected with the Turing test and (machine) intelligence evaluation a few times (Dowe and Hajek 1997, 1998; Mahoney 1999; Dowe et al. 2011). Despite the correlation between perplexity and other evaluation metrics used by human judges, the latter are still used as ground truth to evaluate conversational agents (see, e.g., Adiwardana et al. 2020).
 
8
Bongard problems are pattern recognition puzzles, where the diagrams on the left have something in common (e.g., only containing convex polygons) that the diagrams on the right do not (e.g., containing concavities). Telling where a new diagram should belong correctly (left or right) is assumed to reveal that there is understanding of the underlying concept.
 
9
The Copycat project explored systems that could solve analogies such as “abc is to abd as ijk is to what?”, where giving the right answer should reveal the understanding of the mechanism that generated the strings.
 
10
IQ tests usually include abstract questions with diagrams or numbers. For instance, “What’s the odd out of 40, 3, 20 and 80?” assumes understanding of a common pattern behind three elements but not the fourth.
 
11
The C-test generated letter series using patterns whose algorithmic complexity and ‘unquestionability’ could be estimated from first principles. For instance, solving instances such as “Continue the series: abbcccdddde...” assumes understanding of the pattern that generates the series.
 
12
ARC is also inspired by algorithmic information theory, but the actual instances resemble pixelated versions of the Bongard problems, where there is a pattern that converts some images into others by playing some algorithmic transformation (e.g., filling the closed areas in the image, mirroring an image, etc.). Finding the pattern should indicate understanding of how the transformation works.
 
14
This sonnet was also used by Turing in some of his examples about the imitation game (Turing 1950).
 
16
These judges may have a particular training and developmental process, as child machine judges.
 
Literature
go back to reference Adiwardana, D., Luong, M. T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y., et al. (2020) Towards a human-like open-domain chatbot. arXiv:200109977. Adiwardana, D., Luong, M. T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y., et al. (2020) Towards a human-like open-domain chatbot. arXiv:​200109977.
go back to reference Alvarado, N., Adams, S. S., Burbeck, S., & Latta, C. (2002). Beyond the Turing test: Performance metrics for evaluating a computer simulation of the human mind. In The 2nd international conference on development and learning, 2002 (pp. 147–152). IEEE. Alvarado, N., Adams, S. S., Burbeck, S., & Latta, C. (2002). Beyond the Turing test: Performance metrics for evaluating a computer simulation of the human mind. In The 2nd international conference on development and learning, 2002 (pp. 147–152). IEEE.
go back to reference Arel, I., & Livingston, S. (2009). Beyond the Turing test. Computer, 42(3), 90–91. Arel, I., & Livingston, S. (2009). Beyond the Turing test. Computer, 42(3), 90–91.
go back to reference Armstrong, S., & Sotala, K. (2015). How we’re predicting AI–or failing to. In Beyond artificial intelligence(pp. 11–29). New York: Springer. Armstrong, S., & Sotala, K. (2015). How we’re predicting AI–or failing to. In Beyond artificial intelligence(pp. 11–29). New York: Springer.
go back to reference Arora, S., Ge, R., Liang, Y., Ma, T., & Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANS). In Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 224–232). JMLR. org. Arora, S., Ge, R., Liang, Y., Ma, T., & Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANS). In Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 224–232). JMLR. org.
go back to reference Bhatnagar, S., et al. (2017). Mapping intelligence: Requirements and possibilities. In PTAI (pp. 117–135). New York: Springer. Bhatnagar, S., et al. (2017). Mapping intelligence: Requirements and possibilities. In PTAI (pp. 117–135). New York: Springer.
go back to reference Bongard, M. M. (1970). Pattern Recognition. New York: Spartan Books.MATH Bongard, M. M. (1970). Pattern Recognition. New York: Spartan Books.MATH
go back to reference Borg, M., Johansen, S. S., Thomsen, D. L., & Kraus, M. (2012). Practical implementation of a graphics Turing test. In Advances in visual computing (pp. 305–313). New York: Springer. Borg, M., Johansen, S. S., Thomsen, D. L., & Kraus, M. (2012). Practical implementation of a graphics Turing test. In Advances in visual computing (pp. 305–313). New York: Springer.
go back to reference Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford: Oxford University Press. Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford: Oxford University Press.
go back to reference Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. arXiv:180911096. Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. arXiv:​180911096.
go back to reference Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. arXiv:200514165. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. arXiv:​200514165.
go back to reference Burkart, J. M., Schubiger, M. N., & van Schaik, C. P. (2017). The evolution of general intelligence. Behavioral and Brain Sciences, 40, e195. Burkart, J. M., Schubiger, M. N., & van Schaik, C. P. (2017). The evolution of general intelligence. Behavioral and Brain Sciences, 40, e195.
go back to reference Burr, C., & Cristianini, N. (2019). Can machines read our minds? Minds and Machines, 29(3), 461–494. Burr, C., & Cristianini, N. (2019). Can machines read our minds? Minds and Machines, 29(3), 461–494.
go back to reference Campbell, M., Hoane, A. J., & Hsu, F. (2002). Deep Blue. Artificial Intelligence, 134(1–2), 57–83.MATH Campbell, M., Hoane, A. J., & Hsu, F. (2002). Deep Blue. Artificial Intelligence, 134(1–2), 57–83.MATH
go back to reference Cohen, P. R. (2005). If not Turing’s test, then what? AI Magazine, 26(4), 61. Cohen, P. R. (2005). If not Turing’s test, then what? AI Magazine, 26(4), 61.
go back to reference Copeland, J., & Proudfoot, D. (2008). Turing’s test. A philosophical and historical guide. In R. Epstein, G. Roberts, G. Beber (Eds.), Parsing the Turing Test. Philosophical and Methodological Issues in the Quest for the Thinking Computer. New York: Springer. Copeland, J., & Proudfoot, D. (2008). Turing’s test. A philosophical and historical guide. In R. Epstein, G. Roberts, G. Beber (Eds.), Parsing the Turing Test. Philosophical and Methodological Issues in the Quest for the Thinking Computer. New York: Springer.
go back to reference Crosby, M., Beyret, B., Shanahan, M., Hernandez-Orallo, J., Cheke, L., & Halina, M. (2020). The animal-AI testbed and competition. Proceedings of Machine Learning Research, 123, 164–176. Crosby, M., Beyret, B., Shanahan, M., Hernandez-Orallo, J., Cheke, L., & Halina, M. (2020). The animal-AI testbed and competition. Proceedings of Machine Learning Research, 123, 164–176.
go back to reference Crosby, M., Beyret, B., Hernandez-Orallo, J., Cheke, L., Halina, M., & Shanahan, M. (2019). Translating from animal cognition to AI. NeurIPS workshop on biological and artificial reinforcement learning. Crosby, M., Beyret, B., Hernandez-Orallo, J., Cheke, L., Halina, M., & Shanahan, M. (2019). Translating from animal cognition to AI. NeurIPS workshop on biological and artificial reinforcement learning.
go back to reference Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM, 58(9), 92–103. Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. Communications of the ACM, 58(9), 92–103.
go back to reference Dennett, D. C. (1971). Intentional systems. The Journal of Philosophy, 68, 87–106. Dennett, D. C. (1971). Intentional systems. The Journal of Philosophy, 68, 87–106.
go back to reference Dodge, S., & Karam, L. (2017). A study and comparison of human and deep learning recognition performance under visual distortions. In ICCCN (pp. 1–7). IEEE. Dodge, S., & Karam, L. (2017). A study and comparison of human and deep learning recognition performance under visual distortions. In ICCCN (pp. 1–7). IEEE.
go back to reference Dowe, D. L., & Hernández-Orallo, J. (2012). IQ tests are not for machines, yet. Intelligence, 40(2), 77–81. Dowe, D. L., & Hernández-Orallo, J. (2012). IQ tests are not for machines, yet. Intelligence, 40(2), 77–81.
go back to reference Dowe, D. L., & Hernández-Orallo, J. (2014). How universal can an intelligence test be? Adaptive Behavior, 22(1), 51–69. Dowe, D. L., & Hernández-Orallo, J. (2014). How universal can an intelligence test be? Adaptive Behavior, 22(1), 51–69.
go back to reference Dowe, D. L., Hernández-Orallo, J., & Das, P. K. (2011). Compression and intelligence: Social environments and communication. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds.), Artificial general intelligence (Vol. 6830, pp. 204–211)., LNAI series New York: Springer. Dowe, D. L., Hernández-Orallo, J., & Das, P. K. (2011). Compression and intelligence: Social environments and communication. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds.), Artificial general intelligence (Vol. 6830, pp. 204–211)., LNAI series New York: Springer.
go back to reference Dowe, D. L., Hajek, A. R. (1997). A computational extension to the Turing test. In Proceedings of the 4th Conference of the Australasian Cognitive Science Society, University of Newcastle, NSW, Australia. Also as Technical Report #97/322, Dept Computer Science, Monash University, Australia. Dowe, D. L., Hajek, A. R. (1997). A computational extension to the Turing test. In Proceedings of the 4th Conference of the Australasian Cognitive Science Society, University of Newcastle, NSW, Australia. Also as Technical Report #97/322, Dept Computer Science, Monash University, Australia.
go back to reference Dowe, D. L., Hajek, A. R. (1998). A non-behavioural, computational extension to the Turing Test. In Intl. conf. on computational intelligence & multimedia applications (ICCIMA’98) (pp. 101–106). Gippsland, Australia. Dowe, D. L., Hajek, A. R. (1998). A non-behavioural, computational extension to the Turing Test. In Intl. conf. on computational intelligence & multimedia applications (ICCIMA’98) (pp. 101–106). Gippsland, Australia.
go back to reference Fabra-Boluda, R., Ferri, C., Martínez-Plumed, F., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2020). Family and prejudice: A behavioural taxonomy of machine learning techniques. In ECAI 2020—24st European conference on artificial intelligence. Fabra-Boluda, R., Ferri, C., Martínez-Plumed, F., Hernández-Orallo, J., & Ramírez-Quintana, M. J. (2020). Family and prejudice: A behavioural taxonomy of machine learning techniques. In ECAI 2020—24st European conference on artificial intelligence.
go back to reference Flach, P. (2019). Performance evaluation in machine learning: The good, the bad, the ugly and the way forward. In AAAI. Flach, P. (2019). Performance evaluation in machine learning: The good, the bad, the ugly and the way forward. In AAAI.
go back to reference Fostel, G. (1993). The Turing test is for the birds. ACM SIGART Bulletin, 4(1), 7–8. Fostel, G. (1993). The Turing test is for the birds. ACM SIGART Bulletin, 4(1), 7–8.
go back to reference French, R. M. (1990). Subcognition and the limits of the Turing test. Mind, 99(393), 53–65.MathSciNet French, R. M. (1990). Subcognition and the limits of the Turing test. Mind, 99(393), 53–65.MathSciNet
go back to reference French, R. M. (2000). The Turing test: The first 50 years. Trends in Cognitive Sciences, 4(3), 115–122. French, R. M. (2000). The Turing test: The first 50 years. Trends in Cognitive Sciences, 4(3), 115–122.
go back to reference Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT press.MATH Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT press.MATH
go back to reference Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014a). Generative adversarial nets. In Advances in neural information processing systems (pp 2672–2680). Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014a). Generative adversarial nets. In Advances in neural information processing systems (pp 2672–2680).
go back to reference Groß, R., Gu, Y., Li, W., & Gauci, M. (2017). Generalizing GANs: A Turing perspective. In Advances in neural information processing systems (pp. 6316–6326). Groß, R., Gu, Y., Li, W., & Gauci, M. (2017). Generalizing GANs: A Turing perspective. In Advances in neural information processing systems (pp. 6316–6326).
go back to reference Harnad, S. (1992). The Turing test is not a trick: Turing indistinguishability is a scientific criterion. ACM SIGART Bulletin, 3(4), 9–10. Harnad, S. (1992). The Turing test is not a trick: Turing indistinguishability is a scientific criterion. ACM SIGART Bulletin, 3(4), 9–10.
go back to reference Hayes, P., & Ford, K. (1995). Turing test considered harmful. In International joint conference on artificial intelligence (IJCAI) (pp 972–977). Hayes, P., & Ford, K. (1995). Turing test considered harmful. In International joint conference on artificial intelligence (IJCAI) (pp 972–977).
go back to reference Hernandez-Orallo, J. (2015). Stochastic tasks: Difficulty and Levin search. In J. Bieger, B. Goertzel, & A. Potapov (Eds.), Artificial general intelligence—8th international conference, AGI 2015, Berlin, Germany, July 22–25, 2015 (pp. 90–100). New York: Springer. Hernandez-Orallo, J. (2015). Stochastic tasks: Difficulty and Levin search. In J. Bieger, B. Goertzel, & A. Potapov (Eds.), Artificial general intelligence—8th international conference, AGI 2015, Berlin, Germany, July 22–25, 2015 (pp. 90–100). New York: Springer.
go back to reference Hernández-Orallo, J. (2000). Beyond the Turing test. Journal of Logic, Language & Information, 9(4), 447–466.MathSciNetMATH Hernández-Orallo, J. (2000). Beyond the Turing test. Journal of Logic, Language & Information, 9(4), 447–466.MathSciNetMATH
go back to reference Hernández-Orallo, J. (2001). On the computational measurement of intelligence factors (pp. 72–79). Gaithersburg: NIST Special Publication. Hernández-Orallo, J. (2001). On the computational measurement of intelligence factors (pp. 72–79). Gaithersburg: NIST Special Publication.
go back to reference Hernández-Orallo, J. (2015). On environment difficulty and discriminating power. Autonomous Agents and Multi-Agent Systems, 29, 402–454. Hernández-Orallo, J. (2015). On environment difficulty and discriminating power. Autonomous Agents and Multi-Agent Systems, 29, 402–454.
go back to reference Hernández-Orallo, J. (2017a). Evaluation in artificial intelligence: From task-oriented to ability-oriented measurement. Artificial Intelligence Review, 48(3), 397–447. Hernández-Orallo, J. (2017a). Evaluation in artificial intelligence: From task-oriented to ability-oriented measurement. Artificial Intelligence Review, 48(3), 397–447.
go back to reference Hernández-Orallo, J. (2017b). The measure of all minds: Evaluating natural and artificial intelligence. Cambridge: Cambridge University Press. Hernández-Orallo, J. (2017b). The measure of all minds: Evaluating natural and artificial intelligence. Cambridge: Cambridge University Press.
go back to reference Hernández-Orallo, J. (2019a). Gazing into clever Hans machines. Nature Machine Intelligence, 1(4), 172–173.MathSciNet Hernández-Orallo, J. (2019a). Gazing into clever Hans machines. Nature Machine Intelligence, 1(4), 172–173.MathSciNet
go back to reference Hernández-Orallo, J. (2019b). Unbridled mental power. Nature Physics, 15(1), 106. Hernández-Orallo, J. (2019b). Unbridled mental power. Nature Physics, 15(1), 106.
go back to reference Hernández-Orallo, J., & Dowe, D. L. (2010). Measuring universal intelligence: Towards an anytime intelligence test. Artificial Intelligence, 174(18), 1508–1539.MathSciNet Hernández-Orallo, J., & Dowe, D. L. (2010). Measuring universal intelligence: Towards an anytime intelligence test. Artificial Intelligence, 174(18), 1508–1539.MathSciNet
go back to reference Hernández-Orallo, J., & Dowe, D. L. (2013). On potential cognitive abilities in the machine kingdom. Minds and Machines, 23(2), 179–210. Hernández-Orallo, J., & Dowe, D. L. (2013). On potential cognitive abilities in the machine kingdom. Minds and Machines, 23(2), 179–210.
go back to reference Hernández-Orallo, J., Dowe, D. L., España-Cubillo, S., Hernández-Lloreda, M. V., & Insa-Cabrera, J. (2011). On more realistic environment distributions for defining, evaluating and developing intelligence. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds.), Artificial general intelligence (Vol. 6830, pp. 82–91)., LNAI New York: Springer. Hernández-Orallo, J., Dowe, D. L., España-Cubillo, S., Hernández-Lloreda, M. V., & Insa-Cabrera, J. (2011). On more realistic environment distributions for defining, evaluating and developing intelligence. In J. Schmidhuber, K. Thórisson, & M. Looks (Eds.), Artificial general intelligence (Vol. 6830, pp. 82–91)., LNAI New York: Springer.
go back to reference Hernández-Orallo, J., Dowe, D. L., & Hernández-Lloreda, M. V. (2014). Universal psychometrics: Measuring cognitive abilities in the machine kingdom. Cognitive Systems Research, 27, 50–74. Hernández-Orallo, J., Dowe, D. L., & Hernández-Lloreda, M. V. (2014). Universal psychometrics: Measuring cognitive abilities in the machine kingdom. Cognitive Systems Research, 27, 50–74.
go back to reference Hernández-Orallo, J., Insa-Cabrera, J., Dowe, D. L., & Hibbard, B. (2012). Turing tests with Turing machines. Turing, 10, 140–156. Hernández-Orallo, J., Insa-Cabrera, J., Dowe, D. L., & Hibbard, B. (2012). Turing tests with Turing machines. Turing, 10, 140–156.
go back to reference Hernández-Orallo, J., Martínez-Plumed, F., Schmid, U., Siebers, M., & Dowe, D. L. (2016). Computer models solving intelligence test problems: Progress and implications. Artificial Intelligence, 230, 74–107.MathSciNet Hernández-Orallo, J., Martínez-Plumed, F., Schmid, U., Siebers, M., & Dowe, D. L. (2016). Computer models solving intelligence test problems: Progress and implications. Artificial Intelligence, 230, 74–107.MathSciNet
go back to reference Hernández-Orallo, J. (2015). C-tests revisited: Back and forth with complexity. In J. Bieger, B. Goertzel, & A. Potapov (Eds.), Artificial general intelligence—8th international conference, AGI 2015, Berlin, Germany, July 22–25, 2015. New York: Springer (pp. 272–282). Hernández-Orallo, J. (2015). C-tests revisited: Back and forth with complexity. In J. Bieger, B. Goertzel, & A. Potapov (Eds.), Artificial general intelligence—8th international conference, AGI 2015, Berlin, Germany, July 22–25, 2015. New York: Springer (pp. 272–282).
go back to reference Hernández-Orallo, J. (2020). AI evaluation: On broken yardsticks and measurement scales. Evaluating AI Evaluation @ AAAI. Hernández-Orallo, J. (2020). AI evaluation: On broken yardsticks and measurement scales. Evaluating AI Evaluation @ AAAI.
go back to reference Hernández-Orallo, J., & Minaya-Collado, N. (1998). A formal definition of intelligence based on an intensional variant of Kolmogorov complexity. In Proc. intl symposium of engineering of intelligent systems (EIS’98) (pp. 146–163). ICSC Press. Hernández-Orallo, J., & Minaya-Collado, N. (1998). A formal definition of intelligence based on an intensional variant of Kolmogorov complexity. In Proc. intl symposium of engineering of intelligent systems (EIS’98) (pp. 146–163). ICSC Press.
go back to reference Hernández-Orallo, J., & Vold, K. (2019). Ai extenders: The ethical and societal implications of humans cognitively extended by ai. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 507–513). Hernández-Orallo, J., & Vold, K. (2019). Ai extenders: The ethical and societal implications of humans cognitively extended by ai. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (pp. 507–513).
go back to reference Hernández-Orallo, J., Insa-Cabrera, J., Dowe, D.L., & Hibbard, B. (2012). Turing machines and recursive Turing Tests. In V. Muller, & A. Ayesh (Eds.), AISB/IACAP 2012 Symposium “Revisiting Turing and his Test”, The Society for the Study of Artificial Intelligence and the Simulation of Behaviour, pp 28–33. Hernández-Orallo, J., Insa-Cabrera, J., Dowe, D.L., & Hibbard, B. (2012). Turing machines and recursive Turing Tests. In V. Muller, & A. Ayesh (Eds.), AISB/IACAP 2012 Symposium “Revisiting Turing and his Test”, The Society for the Study of Artificial Intelligence and the Simulation of Behaviour, pp 28–33.
go back to reference Hernández-Orallo, J., Baroni, M., Bieger, J., Chmait, N., Dowe, D. L., Hofmann, K., et al. (2017). A new AI evaluation cosmos: Ready to play the game? AI Magazine, 38(3), Fall 2007. Hernández-Orallo, J., Baroni, M., Bieger, J., Chmait, N., Dowe, D. L., Hofmann, K., et al. (2017). A new AI evaluation cosmos: Ready to play the game? AI Magazine, 38(3), Fall 2007.
go back to reference Hibbard, B. (2008). Adversarial sequence prediction. Frontiers in Artificial Intelligence and Applications, 171, 399. Hibbard, B. (2008). Adversarial sequence prediction. Frontiers in Artificial Intelligence and Applications, 171, 399.
go back to reference Hibbard, B. (2011). Measuring agent intelligence via hierarchies of environments. In Artificial general intelligence (pp. 303–308). New York: Springer. Hibbard, B. (2011). Measuring agent intelligence via hierarchies of environments. In Artificial general intelligence (pp. 303–308). New York: Springer.
go back to reference Hingston, P. (2009). The 2k botprize. In IEEE symposium on computational intelligence and games (CIG 2009) (pp. 1–1). IEEE. Hingston, P. (2009). The 2k botprize. In IEEE symposium on computational intelligence and games (CIG 2009) (pp. 1–1). IEEE.
go back to reference Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In Advances in neural information processing systems (pp. 3–10). Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In Advances in neural information processing systems (pp. 3–10).
go back to reference Hofstadter, D. R. (1980). Gödel, escher, bach. New York: Vintage Books.MATH Hofstadter, D. R. (1980). Gödel, escher, bach. New York: Vintage Books.MATH
go back to reference Hofstadter, D. R., & Mitchell, M. (1994). The Copycat project: A model of mental fluidity and analogy-making. Norwood, NJ: Ablex Publishing. Hofstadter, D. R., & Mitchell, M. (1994). The Copycat project: A model of mental fluidity and analogy-making. Norwood, NJ: Ablex Publishing.
go back to reference Insa-Cabrera, J., Dowe, D. L., España-Cubillo, S., Hernández-Lloreda, M. V., & Hernández-Orallo, J. (2011a). Comparing humans and AI agents. In International conference on artificial general intelligence (pp. 122–132). New York: Springer. Insa-Cabrera, J., Dowe, D. L., España-Cubillo, S., Hernández-Lloreda, M. V., & Hernández-Orallo, J. (2011a). Comparing humans and AI agents. In International conference on artificial general intelligence (pp. 122–132). New York: Springer.
go back to reference Insa-Cabrera, J., Dowe, D. L., & Hernández-Orallo, J. (2011b). Evaluating a reinforcement learning algorithm with a general intelligence test. In J. Lozano, J. Gamez, & J. Moreno (Eds.), Current topics in artificial intelligence (CAEPIA 2011). LNAI Series 7023. New York: Springer. Insa-Cabrera, J., Dowe, D. L., & Hernández-Orallo, J. (2011b). Evaluating a reinforcement learning algorithm with a general intelligence test. In J. Lozano, J. Gamez, & J. Moreno (Eds.), Current topics in artificial intelligence (CAEPIA 2011). LNAI Series 7023. New York: Springer.
go back to reference Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020b). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438. Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020b). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438.
go back to reference Jiang, M., Luketina, J., Nardelli, N., Minervini, P., Torr, P. H., Whiteson, S., & Rocktäschel, T. (2020a). Wordcraft: An environment for benchmarking commonsense agents. arXiv:200709185. Jiang, M., Luketina, J., Nardelli, N., Minervini, P., Torr, P. H., Whiteson, S., & Rocktäschel, T. (2020a). Wordcraft: An environment for benchmarking commonsense agents. arXiv:​200709185.
go back to reference Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., & Aila, T. (2019). Improved precision and recall metric for assessing generative models. arXiv:190406991. Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., & Aila, T. (2019). Improved precision and recall metric for assessing generative models. arXiv:​190406991.
go back to reference Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4), 391–444. Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4), 391–444.
go back to reference Levesque, H. J. (2017). Common sense, the Turing test, and the quest for real AI. New York: MIT Press.MATH Levesque, H. J. (2017). Common sense, the Turing test, and the quest for real AI. New York: MIT Press.MATH
go back to reference Levesque, H., Davis, E., & Morgenstern, L. (2012). The Winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning. Levesque, H., Davis, E., & Morgenstern, L. (2012). The Winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
go back to reference Li, W., Gauci, M., & Groß, R. (2016). Turing learning: A metric-free approach to inferring behavior and its application to swarms. Swarm Intelligence, 10(3), 211–243. Li, W., Gauci, M., & Groß, R. (2016). Turing learning: A metric-free approach to inferring behavior and its application to swarms. Swarm Intelligence, 10(3), 211–243.
go back to reference Li, W., Gauci, M., & Groß, R. (2013). A coevolutionary approach to learn animal behavior through controlled interaction. In Proceedings of the 15th annual conference on Genetic and evolutionary computation (pp. 223–230). Li, W., Gauci, M., & Groß, R. (2013). A coevolutionary approach to learn animal behavior through controlled interaction. In Proceedings of the 15th annual conference on Genetic and evolutionary computation (pp. 223–230).
go back to reference van der Linden, W. J. (2008). Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics, 33(1), 5–20.MathSciNet van der Linden, W. J. (2008). Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics, 33(1), 5–20.MathSciNet
go back to reference Mahoney, M. V. (1999). Text compression as a test for artificial intelligence. In Proceedings of the national conference on artificial intelligence (pp 970–970). AAAI. Mahoney, M. V. (1999). Text compression as a test for artificial intelligence. In Proceedings of the national conference on artificial intelligence (pp 970–970). AAAI.
go back to reference Marcus, G., Rossi, F., & Veloso, M. (2016). Beyond the Turing test (special issue). AI Magazine, 37(1), 3–101. Marcus, G., Rossi, F., & Veloso, M. (2016). Beyond the Turing test (special issue). AI Magazine, 37(1), 3–101.
go back to reference Martinez-Plumed, F., & Hernandez-Orallo, J. (2018). Dual indicators to analyse AI benchmarks: Difficulty, discrimination, ability and generality. IEEE Transactions on Games, 12, 121–131. Martinez-Plumed, F., & Hernandez-Orallo, J. (2018). Dual indicators to analyse AI benchmarks: Difficulty, discrimination, ability and generality. IEEE Transactions on Games, 12, 121–131.
go back to reference Martínez-Plumed, F., Prudêncio, R. B., Martínez-Usó, A., & Hernández-Orallo, J. (2019). Item response theory in AI: Analysing machine learning classifiers at the instance level. Artificial Intelligence, 271, 18–42.MathSciNetMATH Martínez-Plumed, F., Prudêncio, R. B., Martínez-Usó, A., & Hernández-Orallo, J. (2019). Item response theory in AI: Analysing machine learning classifiers at the instance level. Artificial Intelligence, 271, 18–42.MathSciNetMATH
go back to reference Martínez-Plumed, F., Gomez, E., & Hernández-Orallo, J. (2020). Tracking AI: The capability is (not) near. In European conference on artificial intelligence. Martínez-Plumed, F., Gomez, E., & Hernández-Orallo, J. (2020). Tracking AI: The capability is (not) near. In European conference on artificial intelligence.
go back to reference Masum, H., Christensen, S., & Oppacher, F. (2002). The Turing ratio: Metrics for open-ended tasks. In Conf. on genetic and evolutionary computation (pp. 973–980). Morgan Kaufmann. Masum, H., Christensen, S., & Oppacher, F. (2002). The Turing ratio: Metrics for open-ended tasks. In Conf. on genetic and evolutionary computation (pp. 973–980). Morgan Kaufmann.
go back to reference McCarthy, J. (1983). Artificial intelligence needs more emphasis on basic research: President’s quarterly message. AI Magazine, 4(4), 5. McCarthy, J. (1983). Artificial intelligence needs more emphasis on basic research: President’s quarterly message. AI Magazine, 4(4), 5.
go back to reference McDermott, D. (2007). Level-headed. Artificial Intelligence, 171(18), 1183–1186. McDermott, D. (2007). Level-headed. Artificial Intelligence, 171(18), 1183–1186.
go back to reference Mishra, A., Bhattacharyya, P., & Carl, M. (2013). Automatically predicting sentence translation difficulty. In ACL (pp 346–351). Mishra, A., Bhattacharyya, P., & Carl, M. (2013). Automatically predicting sentence translation difficulty. In ACL (pp 346–351).
go back to reference Mitchell, M. (2019). Artificial intelligence: A guide for thinking humans. UK: Penguin. Mitchell, M. (2019). Artificial intelligence: A guide for thinking humans. UK: Penguin.
go back to reference Moor, J. (2003). The Turing test: the elusive standard of artificial intelligence (Vol. 30). New York: Springer Science & Business Media.MATH Moor, J. (2003). The Turing test: the elusive standard of artificial intelligence (Vol. 30). New York: Springer Science & Business Media.MATH
go back to reference Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2019). Adversarial nli: A new benchmark for natural language understanding. arXiv:191014599. Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2019). Adversarial nli: A new benchmark for natural language understanding. arXiv:​191014599.
go back to reference Nilsson, N. J. (2006). Human-level artificial intelligence? Be serious!. AI Magazine, 26(4), 68. Nilsson, N. J. (2006). Human-level artificial intelligence? Be serious!. AI Magazine, 26(4), 68.
go back to reference Preston, B. (1991). AI, anthropocentrism, and the evolution of ‘intelligence’. Minds and Machines, 1(3), 259–277. Preston, B. (1991). AI, anthropocentrism, and the evolution of ‘intelligence’. Minds and Machines, 1(3), 259–277.
go back to reference Proudfoot, D. (2011). Anthropomorphism and AI: Turing’s much misunderstood imitation game. Artificial Intelligence, 175(5), 950–957.MathSciNet Proudfoot, D. (2011). Anthropomorphism and AI: Turing’s much misunderstood imitation game. Artificial Intelligence, 175(5), 950–957.MathSciNet
go back to reference Proudfoot, D. (2017). The Turing test-from every angle. In J. Bowen, M. Sprevak, R. Wilson, & B. J. Copeland (Eds.), The Turing Guide. Oxford: Oxford University Press. Proudfoot, D. (2017). The Turing test-from every angle. In J. Bowen, M. Sprevak, R. Wilson, & B. J. Copeland (Eds.), The Turing Guide. Oxford: Oxford University Press.
go back to reference Rahwan, I., Cebrian, M., Obradovich, N., Bongard, J., Bonnefon, J. F., Breazeal, C., et al. (2019). Machine behaviour. Nature, 568(7753), 477–486. Rahwan, I., Cebrian, M., Obradovich, N., Bongard, J., Bonnefon, J. F., Breazeal, C., et al. (2019). Machine behaviour. Nature, 568(7753), 477–486.
go back to reference Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018). Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38(33), 7255–7269. Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018). Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38(33), 7255–7269.
go back to reference Rozen, O., Shwartz, V., Aharoni, R., & Dagan, I. (2019). Diversify your datasets: Analyzing generalization via controlled variance in adversarial datasets. In Proceedings of the 23rd conference on computational natural language learning (CoNLL), Association for Computational Linguistics, Hong Kong, China, pp. 196–205. Rozen, O., Shwartz, V., Aharoni, R., & Dagan, I. (2019). Diversify your datasets: Analyzing generalization via controlled variance in adversarial datasets. In Proceedings of the 23rd conference on computational natural language learning (CoNLL), Association for Computational Linguistics, Hong Kong, China, pp. 196–205.
go back to reference Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.MathSciNet Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.MathSciNet
go back to reference Sakaguchi, K., Bras, R. L., Bhagavatula, C., & Choi, Y. (2019). Winogrande: An adversarial winograd schema challenge at scale. arXiv:190710641. Sakaguchi, K., Bras, R. L., Bhagavatula, C., & Choi, Y. (2019). Winogrande: An adversarial winograd schema challenge at scale. arXiv:​190710641.
go back to reference Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229.MathSciNet Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229.MathSciNet
go back to reference Saygin, A. P., Cicekli, I., & Akman, V. (2000). Turing test: 50 years later. Minds and Machines, 10(4), 463–518. Saygin, A. P., Cicekli, I., & Akman, V. (2000). Turing test: 50 years later. Minds and Machines, 10(4), 463–518.
go back to reference Schlangen, D. (2019). Language tasks and language games: On methodology in current natural language processing research. arXiv:190810747. Schlangen, D. (2019). Language tasks and language games: On methodology in current natural language processing research. arXiv:​190810747.
go back to reference Schoenick, C., Clark, P., Tafjord, O., Turney, P., & Etzioni, O. (2017). Moving beyond the Turing test with the Allen AI science challenge. Communications of the ACM, 60(9), 60–64. Schoenick, C., Clark, P., Tafjord, O., Turney, P., & Etzioni, O. (2017). Moving beyond the Turing test with the Allen AI science challenge. Communications of the ACM, 60(9), 60–64.
go back to reference Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., Kar, K., Bashivan, P., Prescott-Roy, J., Schmidt, K., Yamins, D. L. K., & DiCarlo, J. J. (2018). Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv preprint. Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., Kar, K., Bashivan, P., Prescott-Roy, J., Schmidt, K., Yamins, D. L. K., & DiCarlo, J. J. (2018). Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv preprint.
go back to reference Schweizer, P. (1998). The truly total Turing test. Minds and Machines, 8(2), 263–272. Schweizer, P. (1998). The truly total Turing test. Minds and Machines, 8(2), 263–272.
go back to reference Sebeok, T. A., & Rosenthal, R. E. (1981). The clever Hans phenomenon: Communication with horses, whales, apes, and people. Annals of the NY Academy of Sciences, 364, 1–17. Sebeok, T. A., & Rosenthal, R. E. (1981). The clever Hans phenomenon: Communication with horses, whales, apes, and people. Annals of the NY Academy of Sciences, 364, 1–17.
go back to reference Seber, G. A. F., & Salehi, M. M. (2013). Adaptive cluster sampling. In Adaptive sampling designs (pp 11–26). New York: Springer. Seber, G. A. F., & Salehi, M. M. (2013). Adaptive cluster sampling. In Adaptive sampling designs (pp 11–26). New York: Springer.
go back to reference Settles, B. (2009). Active learning. Tech. rep., synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool. Settles, B. (2009). Active learning. Tech. rep., synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool.
go back to reference Shah, H., & Warwick, K. (2015). Human or machine? Communications of the ACM, 58(4), 8. Shah, H., & Warwick, K. (2015). Human or machine? Communications of the ACM, 58(4), 8.
go back to reference Shanahan, M. (2015). The technological singularity. New York: MIT Press. Shanahan, M. (2015). The technological singularity. New York: MIT Press.
go back to reference Shoham, Y. (2017). Towards the AI index. AI Magazine, 38(4), 71–77. Shoham, Y. (2017). Towards the AI index. AI Magazine, 38(4), 71–77.
go back to reference Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017b). Mastering the game of Go without human knowledge. Nature, 550(7676), 354–359. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017b). Mastering the game of Go without human knowledge. Nature, 550(7676), 354–359.
go back to reference Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel T, et al. (2017a). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv:171201815. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel T, et al. (2017a). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv:​171201815.
go back to reference Stern, R., Sturtevant, N., Felner, A., Koenig, S, et al. (2019). Multi-agent pathfinding: Definitions, variants, and benchmarks. arXiv:190608291. Stern, R., Sturtevant, N., Felner, A., Koenig, S, et al. (2019). Multi-agent pathfinding: Definitions, variants, and benchmarks. arXiv:​190608291.
go back to reference Sturm, B. L. (2014). A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia, 16(6), 1636–1644. Sturm, B. L. (2014). A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia, 16(6), 1636–1644.
go back to reference Turing, A. (1952). Can automatic calculating machines be said to think? BBC. BBC Third Programme, 14 and 23 Jan. 1952, between M. H. A. Newman, A. M. T., Sir Geoffrey Jefferson and R. B. Braithwaite. Reprinted in Copeland, B. J. (ed.) The essential Turing(pp. 494–495). Oxford: Oxford University Press. http://www.turingarchive.org/browse.php/B/6. Turing, A. (1952). Can automatic calculating machines be said to think? BBC. BBC Third Programme, 14 and 23 Jan. 1952, between M. H. A. Newman, A. M. T., Sir Geoffrey Jefferson and R. B. Braithwaite. Reprinted in Copeland, B. J. (ed.) The essential Turing(pp. 494–495). Oxford: Oxford University Press. http://​www.​turingarchive.​org/​browse.​php/​B/​6.
go back to reference Vale, C. D., & Weiss, D. J. (1975). A study of computer-administered stradaptive ability testing. Tech. rep., Minnesota Univ. Minneapolis Dept. of Psychology. Vale, C. D., & Weiss, D. J. (1975). A study of computer-administered stradaptive ability testing. Tech. rep., Minnesota Univ. Minneapolis Dept. of Psychology.
go back to reference Van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., & Tsang, J. (2017). Hybrid reward architecture for reinforcement learning. In NIPS (pp. 5392–5402). Van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., & Tsang, J. (2017). Hybrid reward architecture for reinforcement learning. In NIPS (pp. 5392–5402).
go back to reference Vardi, M. Y. (2015). Human or machine? Response. Communications of the ACM, 58(4), 8–8. Vardi, M. Y. (2015). Human or machine? Response. Communications of the ACM, 58(4), 8–8.
go back to reference Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al. (2017). Starcraft ii: A new challenge for reinforcement learning. arXiv:170804782. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J., et al. (2017). Starcraft ii: A new challenge for reinforcement learning. arXiv:​170804782.
go back to reference von Ahn, L., Blum, M., & Langford, J. (2004). Telling humans and computers apart automatically. Communications of the ACM, 47(2), 56–60. von Ahn, L., Blum, M., & Langford, J. (2004). Telling humans and computers apart automatically. Communications of the ACM, 47(2), 56–60.
go back to reference von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). RECAPTCHA: Human-based character recognition via web security measures. Science, 321(5895), 1465.MathSciNetMATH von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). RECAPTCHA: Human-based character recognition via web security measures. Science, 321(5895), 1465.MathSciNetMATH
go back to reference Wainer, H. (2000). Computerized adaptive testing: A primer (2nd ed.). Mahwah, NJ: Lawrence Erlabaum Associate Publishers. Wainer, H. (2000). Computerized adaptive testing: A primer (2nd ed.). Mahwah, NJ: Lawrence Erlabaum Associate Publishers.
go back to reference Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv:190500537. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv:​190500537.
go back to reference Watt, S. (1996). Naive psychology and the inverted Turing test. Psycoloquy, 7(14), 463–518. Watt, S. (1996). Naive psychology and the inverted Turing test. Psycoloquy, 7(14), 463–518.
go back to reference Weiss, D. J. (2011). Better data from better measurements using computerized adaptive testing. Journal of Methods and Measurement in the Social Sciences, 2(1), 1–27. Weiss, D. J. (2011). Better data from better measurements using computerized adaptive testing. Journal of Methods and Measurement in the Social Sciences, 2(1), 1–27.
go back to reference You, J. (2015). Beyond the Turing test. Science, 347(6218), 116–116. You, J. (2015). Beyond the Turing test. Science, 347(6218), 116–116.
go back to reference Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences, 112(4), 1036–1040. Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences, 112(4), 1036–1040.
go back to reference Zadeh, L. A. (2008). Toward human level machine intelligence-Is it achievable? The need for a paradigm shift. IEEE Computational Intelligence Magazine, 3(3), 11–22. Zadeh, L. A. (2008). Toward human level machine intelligence-Is it achievable? The need for a paradigm shift. IEEE Computational Intelligence Magazine, 3(3), 11–22.
go back to reference Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 conference on empirical methods in natural language processing (EMNLP). Zellers, R., Bisk, Y., Schwartz, R., & Choi, Y. (2018). Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 conference on empirical methods in natural language processing (EMNLP).
go back to reference Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). Hellaswag: Can a machine really finish your sentence? arXiv:190507830. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). Hellaswag: Can a machine really finish your sentence? arXiv:​190507830.
go back to reference Zhou, P., Khanna, R., Lin, B. Y., Ho, D., Ren, X., & Pujara, J. (2020). Can BERT reason? logically equivalent probes for evaluating the inference capabilities of language models. arXiv:200500782. Zhou, P., Khanna, R., Lin, B. Y., Ho, D., Ren, X., & Pujara, J. (2020). Can BERT reason? logically equivalent probes for evaluating the inference capabilities of language models. arXiv:​200500782.
go back to reference Zillich, M. (2012). My robot is smarter than your robot. on the need for a total Turing test for robots. In: V. Muller & A. Ayesh (Eds.), AISB/IACAP 2012 symposium “revisiting turing and his test”, The Society for the Study of Artificial Intelligence and the Simulation of Behaviour, pp. 12–15. Zillich, M. (2012). My robot is smarter than your robot. on the need for a total Turing test for robots. In: V. Muller & A. Ayesh (Eds.), AISB/IACAP 2012 symposium “revisiting turing and his test”, The Society for the Study of Artificial Intelligence and the Simulation of Behaviour, pp. 12–15.
Metadata
Title
Twenty Years Beyond the Turing Test: Moving Beyond the Human Judges Too
Author
José Hernández-Orallo
Publication date
04-11-2020
Publisher
Springer Netherlands
Published in
Minds and Machines / Issue 4/2020
Print ISSN: 0924-6495
Electronic ISSN: 1572-8641
DOI
https://doi.org/10.1007/s11023-020-09549-0

Other articles of this Issue 4/2020

Minds and Machines 4/2020 Go to the issue

Premium Partner