Skip to main content

Advertisement

Log in

Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

The evaluation of artificial intelligence systems and components is crucial for the progress of the discipline. In this paper we describe and critically assess the different ways AI systems are evaluated, and the role of components and techniques in these systems. We first focus on the traditional task-oriented evaluation approach. We identify three kinds of evaluation: human discrimination, problem benchmarks and peer confrontation. We describe some of the limitations of the many evaluation schemes and competitions in these three categories, and follow the progression of some of these tests. We then focus on a less customary (and challenging) ability-oriented evaluation approach, where a system is characterised by its (cognitive) abilities, rather than by the tasks it is designed to solve. We discuss several possibilities: the adaptation of cognitive tests used for humans and animals, the development of tests derived from algorithmic information theory or more integrated approaches under the perspective of universal psychometrics. We analyse some evaluation tests from AI that are better positioned for an ability-oriented evaluation and discuss how their problems and limitations can possibly be addressed with some of the tools and ideas that appear within the paper. Finally, we enumerate a series of lessons learnt and generic guidelines to be used when an AI evaluation scheme is under consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Worst-case performance and best-case performance are special cases of a rank-based aggregation (using the cumulative distribution of results), with other possibilities such as the median, the first decile, etc. Rank-based aggregation, especially worst-case performance, is more robust to systems getting good scores on many easy problems but doing poorly on the difficult problems.

  2. Note that this formula does not have the size of the instance as a parameter, and hence it is not comparable to the usual view of worst-case analysis of algorithms.

  3. The distinction between white and black box can be enriched to consider those problems where the solution must be accompanied by a verification, proof or explanation (Hernández-Orallo 2000b; Alpcan et al. 2014).

  4. Although it is not uncommon, as we will see, that the set of problems from M are chosen by the research team that is evaluating its own method, so the probability to choose from M can be biased in such a way that it is actually a best-case evaluation.

  5. http://www.loebner.net/Prizef/loebner-prize.html.

  6. http://openml.org/.

  7. http://www.kdd.org/kdd-cup.

  8. http://www.ecmlpkdd2015.org/discovery-challenges.

  9. http://www.kaggle.com.

  10. http://www.rl-competition.org/.

  11. http://www.arcadelearningenvironment.org/.

  12. http://www.gvgai.net/.

  13. http://archive.darpa.mil/grandchallenge/.

  14. http://www.cybergrandchallenge.com/.

  15. http://www.statmt.org/europarl/, http://www.statmt.org/setimes/, http://matrix.statmt.org/matrix/info.

  16. http://www.nist.gov/itl/iad/mig/openmt.cfm.

  17. http://www.imageclef.org/.

  18. Statistical tests are not used to determine when a contestant can be said to be significantly better than another.

  19. http://www.icga.org/.

  20. http://games.stanford.edu/.

  21. http://www.robocup.org/.

  22. http://gaips.inesc-id.pt/geometryfriends/.

  23. http://www.chalearn.org/.

  24. http://www.nist.gov/tac/.

  25. http://trec.nist.gov/.

  26. http://commonsensereasoning.org.

  27. www.human-competitive.org.

  28. http://pebl.sourceforge.net/battery.html.

  29. The Chinese room argument was introduced by (Searle 1980) to argue against the possibility of a machine having a mind, by comparing a computer processing inputs and outputs as symbols with a person knowing no Chinese in a room receiving messages in Chinese that have to be answered, also in Chinese, using a series of books to map inputs to outputs. Given the relevance of machine learning in AI nowadays, among other things, the argument has mostly faded today.

  30. https://www.microsoft.com/en-us/research/project/project-malmo/.

  31. https://gym.openai.com/.

  32. http://www.inductive-programming.org/repository.html.

  33. http://www.robot.uji.es/EURON/en/index.htm, http://rockinrobotchallenge.eu/Benchmarking_Robotics.

  34. http://bicasociety.org/cogarch/architectures.htm.

  35. http://cogarch.org/index.php/Capabilities.

  36. http://users.dsic.upv.es/~flip/EGPAI2016/.

  37. http://openml.org/.

References

  • Abel D, Agarwal A, Diaz F, Krishnamurthy A, Schapire RE (2016) Exploratory gradient boosting for reinforcement learning in complex domains. arXiv preprint arXiv:1603.04119

  • Adams S, Arel I, Bach J, Coop R, Furlan R, Goertzel B, Hall JS, Samsonovich A, Scheutz M, Schlesinger M, Shapiro SC, Sowa J (2012) Mapping the landscape of human-level artificial general intelligence. AI Mag 33(1):25–42

    Article  Google Scholar 

  • Adams SS, Banavar G, Campbell M (2016) I-athlon: towards a multi-dimensional Turing test. AI Mag 37(1):78–84

    Article  Google Scholar 

  • Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17:255–287

    Google Scholar 

  • Alexander JRM, Smales S (1997) Intelligence, learning and long-term memory. Personal Individ Differ 23(5):815–825

    Article  Google Scholar 

  • Alpcan T, Everitt T, Hutter M (2014) Can we measure the difficulty of an optimization problem? In: IEEE information theory workshop (ITW)

  • Alur R, Bodik R, Juniwal G, Martin MMK, Raghothaman M, Seshia SA, Singh R, Solar-Lezama A, Torlak E, Udupa A (2013) Syntax-guided synthesis. In: Formal methods in computer-aided design (FMCAD), 2013, IEEE, pp 1–17

  • Alvarado N, Adams SS, Burbeck S, Latta C (2002) Beyond the Turing test: performance metrics for evaluating a computer simulation of the human mind. In: Proceedings of the 2nd international conference on development and learning, IEEE, pp 147–152

  • Amigoni F, Bastianelli E, Berghofer J, Bonarini A, Fontana G, Hochgeschwender N, Iocchi L, Kraetzschmar G, Lima P, Matteucci M, Miraldo P, Nardi D, Schiaffonati V (2015) Competitions for benchmarking: task and functionality scoring complete performance assessment. IEEE Robot Autom Mag 22(3):53–61

    Article  Google Scholar 

  • Anderson J, Lebiere C (2003) The Newell test for a theory of cognition. Behav Brain Sci 26(5):587–601

    Google Scholar 

  • Anderson J, Baltes J, Cheng CT (2011) Robotics competitions as benchmarks for AI research. Knowl Eng Rev 26(01):11–17

    Article  Google Scholar 

  • Arel I, Rose DC, Karnowski TP (2010) Deep machine learning—a new frontier in artificial intelligence research. IEEE Comput Intell Mag 5(4):13–18

    Article  Google Scholar 

  • Asada M, Hosoda K, Kuniyoshi Y, Ishiguro H, Inui T, Yoshikawa Y, Ogino M, Yoshida C (2009) Cognitive developmental robotics: a survey. IEEE Trans Auton Ment Dev 1(1):12–34

    Article  Google Scholar 

  • Aziz H, Brill M, Fischer F, Harrenstein P, Lang J, Seedig HG (2015) Possible and necessary winners of partial tournaments. J Artif Intell Res 54:493–534

    MathSciNet  MATH  Google Scholar 

  • Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Bagnall AJ, Zatuchna ZV (2005) On the classification of maze problems. In: Bull L, Kovacs T (eds) Foundations of learning classifier system. Studies in fuzziness and soft computing, vol. 183, Springer, pp 305–316. http://rd.springer.com/chapter/10.1007/11319122_12

  • Baldwin D, Yadav SB (1995) The process of research investigations in artificial intelligence - a unified view. IEEE Trans Syst Man Cybern 25(5):852–861

    Article  Google Scholar 

  • Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res 47:253–279

    Google Scholar 

  • Besold TR (2014) A note on chances and limitations of psychometric ai. In: KI 2014: advances in artificial intelligence. Springer, pp 49–54

  • Biever C (2011) Ultimate IQ: one test to rule them all. New Sci 211(2829, 10 September 2011):42–45

    Article  Google Scholar 

  • Borg M, Johansen SS, Thomsen DL, Kraus M (2012) Practical implementation of a graphics Turing test. In: Advances in visual computing. Springer, pp 305–313

  • Boring EG (1923) Intelligence as the tests test it. New Repub 35–37

  • Bostrom N (2014) Superintelligence: paths, dangers, strategies. Oxford University Press, Oxford

    Google Scholar 

  • Brazdil P, Carrier CG, Soares C, Vilalta R (2008) Metalearning: applications to data mining. Springer, New York

    MATH  Google Scholar 

  • Bringsjord S (2011) Psychometric artificial intelligence. J Exp Theor Artif Intell 23(3):271–277

    Article  Google Scholar 

  • Bringsjord S, Schimanski B (2003) What is artificial intelligence? Psychometric AI as an answer. In: International joint conference on artificial intelligence, pp 887–893

  • Brundage M (2016) Modeling progress in ai. AAAI 2016 Workshop on AI, Ethics, and Society

  • Buchanan BG (1988) Artificial intelligence as an experimental science. Springer, New York

    Book  Google Scholar 

  • Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s mechanical turk a new source of inexpensive, yet high-quality, data? Perspect Psychol Sci 6(1):3–5

    Article  Google Scholar 

  • Bursztein E, Aigrain J, Moscicki A, Mitchell JC (2014) The end is nigh: generic solving of text-based captchas. In: Proceedings of the 8th USENIX conference on Offensive Technologies, USENIX Association, p 3

  • Campbell M, Hoane AJ, Hsu F (2002) Deep Blue. Artif Intell 134(1–2):57–83

    Article  MATH  Google Scholar 

  • Cangelosi A, Schlesinger M, Smith LB (2015) Developmental robotics: from babies to robots. MIT Press, Cambridge

    Google Scholar 

  • Caputo B, Müller H, Martinez-Gomez J, Villegas M, Acar B, Patricia N, Marvasti N, Üsküdarlı S, Paredes R, Cazorla M et al (2014) Imageclef 2014: overview and analysis of the results. In: Information access evaluation. Multilinguality, multimodality, and interaction, Springer, pp 192–211

  • Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER Jr, Mitchell TM (2010) Toward an architecture for never-ending language learning. In: AAAI, vol 5, p 3

  • Carroll JB (1993) Human cognitive abilities: a survey of factor-analytic studies. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75

    Article  MathSciNet  Google Scholar 

  • Chaitin GJ (1982) Gödel’s theorem and information. Int J Theor Phys 21(12):941–954

    Article  MATH  Google Scholar 

  • Chandrasekaran B (1990) What kind of information processing is intelligence? In: The foundation of artificial intelligence—a sourcebook. Cambridge University Press, pp 14–46

  • Chater N (1999) The search for simplicity: a fundamental cognitive principle? Q J Exp Psychol Sect A 52(2):273–302

    Article  Google Scholar 

  • Chater N, Vitányi P (2003) Simplicity: a unifying principle in cognitive science? Trends Cogn Sci 7(1):19–22

    Article  Google Scholar 

  • Chu Z, Gianvecchio S, Wang H, Jajodia S (2010) Who is tweeting on twitter: human, bot, or cyborg? In: Proceedings of the 26th annual computer security applications conference, ACM, pp 21–30

  • Cochran WG (2007) Sampling techniques. Wiley, New York

    MATH  Google Scholar 

  • Cohen PR, Howe AE (1988) How evaluation guides AI research: the message still counts more than the medium. AI Mag 9(4):35

    Google Scholar 

  • Cohen Y (2013) Testing and cognitive enhancement. Technical repor, National Institute for Testing and Evaluation, Jerusalem, Israel

  • Conrad JG, Zeleznikow J (2013) The significance of evaluation in AI and law: a case study re-examining ICAIL proceedings. In: Proceedings of the 14th international conference on artificial intelligence and law, ACM, pp 186–191

  • Conrad JG, Zeleznikow J (2015) The role of evaluation in ai and law. In: Proceedings of the 15th international conference on artificial intelligence and law, pp 181–186

  • Deary IJ, Der G, Ford G (2001) Reaction times and intelligence differences: a population-based cohort study. Intelligence 29(5):389–399

    Article  Google Scholar 

  • Decker KS, Durfee EH, Lesser VR (1989) Evaluating research in cooperative distributed problem solving. Distrib Artif Intell 2:487–519

    Google Scholar 

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Detterman DK (2011) A challenge to Watson. Intelligence 39(2–3):77–78

    Article  Google Scholar 

  • Dimitrakakis C (2016) Personal communication

  • Dimitrakakis C, Li G, Tziortziotis N (2014) The reinforcement learning competition 2014. AI Mag 35(3):61–65

    Article  Google Scholar 

  • Dowe DL (2013) Introduction to Ray Solomonoff 85th memorial conference. In: Dowe DL (ed) Algorithmic probability and friends. Bayesian prediction and artificial intelligence, lecture notes in computer science, vol 7070. Springer, Berlin, pp 1–36

    Chapter  Google Scholar 

  • Dowe DL, Hajek AR (1997) A computational extension to the Turing Test. In: Proceedings of the 4th conference of the Australasian cognitive science society, University of Newcastle, NSW, Australia

  • Dowe DL, Hajek AR (1998) A non-behavioural, computational extension to the Turing test. In: International conference on computational intelligence and multimedia applications (ICCIMA’98), Gippsland, Australia, pp 101–106

  • Dowe DL, Hernández-Orallo J (2012) IQ tests are not for machines, yet. Intelligence 40(2):77–81

    Article  Google Scholar 

  • Dowe DL, Hernández-Orallo J (2014) How universal can an intelligence test be? Adapt Behav 22(1):51–69

    Article  Google Scholar 

  • Drummond C (2009) Replicability is not reproducibility: nor is it good science. In: Proceedings of the evaluation methods for machine learning workshop at the 26th ICML, Montreal, Canada

  • Drummond C, Japkowicz N (2010) Warning: statistical benchmarking is addictive. Kicking the habit in machine learning. J Exp Theor Artif Intell 22(1):67–80

    Article  MATH  Google Scholar 

  • Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P (2016) Benchmarking deep reinforcement learning for continuous control. arXiv preprint arXiv:1604.06778

  • Eden AH, Moor JH, Soraker JH, Steinhart E (2013) Singularity hypotheses: a scientific and philosophical assessment. Springer, New York

    MATH  Google Scholar 

  • Edmondson W (2012) The intelligence in ETI—what can we know? Acta Astronaut 78:37–42

    Article  Google Scholar 

  • Elo AE (1978) The rating of chessplayers, past and present, vol 3. Batsford, London

    Google Scholar 

  • Embretson SE, Reise SP (2000) Item response theory for psychologists. L. Erlbaum, Hillsdale

    Google Scholar 

  • Evans JM, Messina ER (2001) Performance metrics for intelligent systems. NIST Special Publication SP, pp 101–104

  • Everitt T, Lattimore T, Hutter M (2014) Free lunch for optimisation under the universal distribution. In: 2014 IEEE Congress on evolutionary computation (CEC), IEEE, pp 167–174

  • Falkenauer E (1998) On method overfitting. J Heuristics 4(3):281–287

    Article  MATH  Google Scholar 

  • Feldman J (2003) Simplicity and complexity in human concept learning. Gen Psychol 38(1):9–15

    Google Scholar 

  • Ferrando PJ (2009) Difficulty, discrimination, and information indices in the linear factor analysis model for continuous item responses. Appl Psychol Meas 33(1):9–24

    Article  MathSciNet  Google Scholar 

  • Ferrando PJ (2012) Assessing the discriminating power of item and test scores in the linear factor-analysis model. Psicológica 33:111–139

    Google Scholar 

  • Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recogn Lett 30(1):27–38

    Article  Google Scholar 

  • Ferrucci D, Brown E, Chu-Carroll J, Fan J, Gondek D, Kalyanpur AA, Lally A, Murdock J, Nyberg E, Prager J et al (2010) Building Watson: an overview of the DeepQA project. AI Mag 31(3):59–79

    Article  Google Scholar 

  • Fogel DB (1991) The evolution of intelligent decision making in gaming. Cybern Syst 22(2):223–236

    Article  Google Scholar 

  • Gaschnig J, Klahr P, Pople H, Shortliffe E, Terry A (1983) Evaluation of expert systems: issues and case studies. Build Exp Syst 1:241–278

    Google Scholar 

  • Geissman JR, Schultz RD (1988) Verification & validation. AI Exp 3(2):26–33

    Google Scholar 

  • Genesereth M, Love N, Pell B (2005) General game playing: overview of the AAAI competition. AI Mag 26(2):62

    Google Scholar 

  • Gerónimo D, López AM (2014) Datasets and benchmarking. In: Vision-based pedestrian protection systems for intelligent vehicles. Springer, pp 87–93

  • Goertzel B, Pennachin C (eds) (2007) Artificial general intelligence. Springer, New York

    MATH  Google Scholar 

  • Goertzel B, Arel I, Scheutz M (2009) Toward a roadmap for human-level artificial general intelligence: embedding HLAI systems in broad, approachable, physical or virtual contexts. Artif Gen Intell Roadmap Initiat

  • Goldreich O, Vadhan S (2007) Special issue on worst-case versus average-case complexity editors’ foreword. Comput complex 16(4):325–330

    Article  MathSciNet  Google Scholar 

  • Gordon BB (2007) Report on panel discussion on (re-)establishing or increasing collaborative links between artificial intelligence and intelligent systems. In: Messina ER, Madhavan R (eds) Proceedings of the 2007 workshop on performance metrics for intelligent systems, pp 302–303

  • Gulwani S, Hernández-Orallo J, Kitzelmann E, Muggleton SH, Schmid U, Zorn B (2015) Inductive programming meets the real world. Commun ACM 58(11):90–99

    Article  Google Scholar 

  • Hand DJ (2004) Measurement theory and practice. A Hodder Arnold Publication, London

    MATH  Google Scholar 

  • Hernández-Orallo J (2000a) Beyond the Turing test. J Logic Lang Inf 9(4):447–466

    Article  MathSciNet  MATH  Google Scholar 

  • Hernández-Orallo J (2000b) On the computational measurement of intelligence factors. In: Meystel A (ed) Performance metrics for intelligent systems workshop. National Institute of Standards and Technology, Gaithersburg, pp 1–8

    Google Scholar 

  • Hernández-Orallo J (2000c) Thesis: computational measures of information gain and reinforcement in inference processes. AI Commun 13(1):49–50

    Google Scholar 

  • Hernández-Orallo J (2010) A (hopefully) non-biased universal environment class for measuring intelligence of biological and artificial systems. In: Artificial general intelligence, 3rd International Conference. Atlantis Press, Extended report at http://users.dsic.upv.es/proy/anynt/unbiased.pdf, pp 182–183

  • Hernández-Orallo J (2014) On environment difficulty and discriminating power. Auton Agents Multi-Agent Syst. 29(3):402–454. doi:10.1007/s10458-014-9257-1

  • Hernández-Orallo J, Dowe DL (2010) Measuring universal intelligence: towards an anytime intelligence test. Artif Intell 174(18):1508–1539

    Article  MathSciNet  Google Scholar 

  • Hernández-Orallo J, Dowe DL (2013) On potential cognitive abilities in the machine kingdom. Minds Mach 23:179–210

    Article  Google Scholar 

  • Hernández-Orallo J, Minaya-Collado N (1998) A formal definition of intelligence based on an intensional variant of Kolmogorov complexity. In: Proceedings of international symposium of engineering of intelligent systems (EIS’98), ICSC Press, pp 146–163

  • Hernández-Orallo J, Dowe DL, España-Cubillo S, Hernández-Lloreda MV, Insa-Cabrera J (2011) On more realistic environment distributions for defining, evaluating and developing intelligence. In: Schmidhuber J, Thórisson K, Looks M (eds) Artificial general intelligence, LNAI, vol 6830. Springer, New York, pp 82–91

    Chapter  Google Scholar 

  • Hernández-Orallo J, Flach P, Ferri C (2012a) A unified view of performance metrics: translating threshold choice into expected classification loss. J Mach Learn Res 13(1):2813–2869

    MathSciNet  MATH  Google Scholar 

  • Hernández-Orallo J, Insa-Cabrera J, Dowe DL, Hibbard B (2012b) Turing Tests with Turing machines. In: Voronkov A (ed) Turing-100, EPiC Series, vol 10, pp 140–156

  • Hernández-Orallo J, Dowe DL, Hernández-Lloreda MV (2014) Universal psychometrics: measuring cognitive abilities in the machine kingdom. Cogn Syst Res 27:50–74

    Article  Google Scholar 

  • Hernández-Orallo J, Martínez-Plumed F, Schmid U, Siebers M, Dowe DL (2016) Computer models solving intelligence test problems: progress and implications. Artif Intell 230:74–107

    Article  MathSciNet  Google Scholar 

  • Herrmann E, Call J, Hernández-Lloreda MV, Hare B, Tomasello M (2007) Humans have evolved specialized skills of social cognition: the cultural intelligence hypothesis. Science 317(5843):1360–1366

    Article  Google Scholar 

  • Hibbard B (2009) Bias and no free lunch in formal measures of intelligence. J Artif Gen Intell 1(1):54–61

    Google Scholar 

  • Hingston P (2010) A new design for a Turing Test for bots. In: 2010 IEEE symposium on computational intelligence and games (CIG), IEEE, pp 345–350

  • Hingston P (2012) Believable bots: can computers play like people?. Springer, New York

    Book  Google Scholar 

  • Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  • Hutter M (2007) Universal algorithmic intelligence: a mathematical top\(\rightarrow \)down approach. In: Goertzel B, Pennachin C (eds) Artificial general intelligence, cognitive technologies. Springer, Berlin, pp 227–290

    Chapter  Google Scholar 

  • Igel C, Toussaint M (2005) A no-free-lunch theorem for non-uniform distributions of target functions. J Math Model Algorithms 3(4):313–322

    Article  MathSciNet  MATH  Google Scholar 

  • Insa-Cabrera J (2016) Towards a universal test of social intelligence. Ph.D. thesis, Departament de Sistemes Informátics i Computació, UPV

  • Insa-Cabrera J, Dowe DL, España-Cubillo S, Hernández-Lloreda MV, Hernández-Orallo J (2011a) Comparing humans and ai agents. In: Schmidhuber J, Thórisson K, Looks M (eds) Artificial general intelligence, LNAI, vol 6830. Springer, New York, pp 122–132

    Chapter  Google Scholar 

  • Insa-Cabrera J, Dowe DL, Hernández-Orallo J (2011) Evaluating a reinforcement learning algorithm with a general intelligence test. In: Lozano JA, Gamez JM (eds) Current topics in artificial intelligence. CAEPIA 2011, LNAI series 7023. Springer, New York

    Google Scholar 

  • Insa-Cabrera J, Benacloch-Ayuso JL, Hernández-Orallo J (2012) On measuring social intelligence: experiments on competition and cooperation. In: Bach J, Goertzel B, Iklé M (eds) AGI, lecture notes in computer science, vol 7716. Springer, New York, pp 126–135

    Google Scholar 

  • Jacoff A, Messina E, Weiss BA, Tadokoro S, Nakagawa Y (2003) Test arenas and performance metrics for urban search and rescue robots. In: Proceedings of 2003 IEEE/RSJ international conference on intelligent robots and systems, 2003 (IROS 2003), IEEE, vol 4, pp 3396–3403

  • Japkowicz N, Shah M (2011) Evaluating learning algorithms. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Jiang J (2008) A literature survey on domain adaptation of statistical classifiers. http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey

  • Johnson M, Hofmann K, Hutton T, Bignell D (2016) The Malmo platform for artificial intelligence experimentation. In: International joint conference on artificial intelligence (IJCAI)

  • Keith TZ, Reynolds MR (2010) Cattell–Horn–Carroll abilities and cognitive tests: what we’ve learned from 20 years of research. Psychol Schools 47(7):635–650

    Google Scholar 

  • Ketter W, Symeonidis A (2012) Competitive benchmarking: lessons learned from the trading agent competition. AI Mag 33(2):103

    Article  Google Scholar 

  • Khreich W, Granger E, Miri A, Sabourin R (2012) A survey of techniques for incremental learning of HMM parameters. Inf Sci 197:105–130

    Article  Google Scholar 

  • Kim JH (2004) Soccer robotics, vol 11. Springer, New York

    Google Scholar 

  • Kitano H, Asada M, Kuniyoshi Y, Noda I, Osawa E (1997) Robocup: the robot world cup initiative. In: Proceedings of the first international conference on autonomous agents, ACM, pp 340–347

  • Kleiner K (2011) Who are you calling bird-brained? An attempt is being made to devise a universal intelligence test. Economist 398(8723, 5 March 2011):82

    Google Scholar 

  • Knuth DE (1973) Sorting and searching, volume 3 of the art of computer programming. Addison-Wesley, Reading

    MATH  Google Scholar 

  • Koza JR (2010) Human-competitive results produced by genetic programming. Genet Program Evolvable Mach 11(3–4):251–284

    Article  Google Scholar 

  • Krueger J, Osherson D (1980) On the psychology of structural simplicity. In: Jusczyk PW, Klein RM (eds) The nature of thought: essays in honor of D. O. Hebb. Psychology Press, London, pp 187–205

    Google Scholar 

  • Langford J (2005) Clever methods of overfitting. Machine Learning (Theory). http://hunch.net

  • Langley P (1987) Research papers in machine learning. Mach Learn 2(3):195–198

    Google Scholar 

  • Langley P (2011) The changing science of machine learning. Mach Learn 82(3):275–279

    Article  MATH  Google Scholar 

  • Langley P (2012) The cognitive systems paradigm. Adv Cogn Syst 1:3–13

    Google Scholar 

  • Lattimore T, Hutter M (2013) No free lunch versus Occam’s razor in supervised learning. Algorithmic Probability and Friends. Springer, Bayesian Prediction and Artificial Intelligence, pp 223–235

  • Leeuwenberg ELJ, Van Der Helm PA (2012) Structural information theory: the simplicity of visual form. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Legg S, Hutter M (2007a) Tests of machine intelligence. In: Lungarella M, Iida F, Bongard J, Pfeifer R (eds) 50 Years of Artificial Intelligence, Lecture Notes in Computer Science, vol 4850, Springer Berlin Heidelberg, pp 232–242. doi:10.1007/978-3-540-77296-5_22

  • Legg S, Hutter M (2007b) Universal intelligence: a definition of machine intelligence. Minds Mach 17(4):391–444

    Article  Google Scholar 

  • Legg S, Veness J (2013) An approximation of the universal intelligence measure. Algorithmic Probability and Friends. Springer, Bayesian Prediction and Artificial Intelligence, pp 236–249

  • Levesque HJ (2014) On our best behaviour. Artif Intell 212:27–35

    Article  MathSciNet  MATH  Google Scholar 

  • Levesque HJ, Davis E, Morgenstern L (2012) The winograd schema challenge. In: Proceedings of the thirteenth international conference on the principles of knowledge representation and reasoning, pp 552–561

  • Levin LA (1973) Universal sequential search problems. Prob Inf Transm 9(3):265–266

    Google Scholar 

  • Levin LA (1986) Average case complete problems. SIAM J Comput 15:285–286

    Article  MathSciNet  MATH  Google Scholar 

  • Levin LA (2013) Universal heuristics: how do humans solve unsolvable problems? In: Dowe DL (ed) Algorithmic probability and friends. Bayesian prediction and artificial intelligence, lecture notes in computer science, vol 7070. Springer, New York, pp 53–54

    Chapter  Google Scholar 

  • Li M, Vitányi P (2008) An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer, New York

    Book  MATH  Google Scholar 

  • Livingstone D (2006) Turing’s test and believable AI in games. Comput Entertain CIE 4(1):6

    Article  Google Scholar 

  • Llargues-Asensio JM, Peralta J, Arrabales R, González-Bedía M, Cortez P, López-Peña AL (2014) Artificial intelligence approaches for the generation and assessment of believable human-like behaviour in virtual characters. Expert Systems with Applications

  • Long D, Fox M (2003) The 3rd international planning competition: results and analysis. J Artif Intell Res JAIR 20:1–59

    Article  MATH  Google Scholar 

  • Lord FM (1980) Applications of item response theory to practical testing problems. Erlbaum, Mahwah

    Google Scholar 

  • Macià N, Bernadó-Mansilla E (2014) Towards UCI+: a mindful repository design. Inf Sci 261:237–262

    Article  Google Scholar 

  • Madhavan R, Tunstel E, Messina E (2009) Performance evaluation and benchmarking of intelligent systems. Springer, New York

    Book  Google Scholar 

  • Mahoney MV (1999) Text compression as a test for artificial intelligence. In: Proceedings of the national conference on artificial intelligence, AAAI, p 970

  • Marché C, Zantema H (2007) The termination competition. In: Term rewriting and applications, Springer, pp 303–313

  • Marcus G, Rossi F, Veloso M (2016) Beyond the Turing test (special issue). AI Mag 37(1):3–101

    Article  Google Scholar 

  • Masum H, Christensen S (2003) The turing ratio: a framework for open-ended task metrics. J Evol Technol

  • Masum H, Christensen S, Oppacher F (2002) The turing ratio: metrics for open-ended tasks. In: GECCO, Citeseer, pp 973–980

  • McCarthy J (2007) What is artificial intelligence. Technical report, Stanford University. http://www-formal.stanford.edu/jmc/whatisai.html

  • McCorduck P (2004) Machines who think. A K Peters/CRC Press, Boca Raton

    Google Scholar 

  • McDermott J, White DR, Luke S, Manzoni L, Castelli M, Vanneschi L, Jaśkowski W, Krawiec K, Harper R, Jong KD, O’Reilly UM (2012) Genetic programming needs better benchmarks. In: Proceedings of the 14th international conference on Genetic and evolutionary computation conference. ACM, Philadelphia, pp 791–798

  • McGuigan M (2006) Graphics Turing Test. arXiv preprint arXiv:cs/0603132

  • Melkikh AV (2014) The no free lunch theorem and hypothesis of instinctive animal behavior. Artif Intell Res 3(4):p43

    Article  Google Scholar 

  • Mellenbergh GJ (1994) Generalized linear item response theory. Psychol Bull 115(2):300

    Article  Google Scholar 

  • Mesnil G, Dauphin Y, Glorot X, Rifai S, Bengio Y, Goodfellow IJ, Lavoie E, Muller X, Desjardins G, Warde-Farley D, et al (2012) Unsupervised and transfer learning challenge: a deep learning approach. JMLR: Workshop and Conference Proceedings, 2012 ICML Workshop on Unsupervised and Transfer Learning vol 27, pp 97–110

  • Messina E, Meystel A, Reeker L (2001) PerMIS 2001, white paper. In: Meystel AM, Messina ER (eds) Measuring the performance and intelligence of systems: proceedings of the 2001 PerMIS Workshop, September 4, 2001, National Institute of Standards and Technology (NIST) Special Publication 982. Gaithersburg, pp 3–15

  • Meystel A (2000) Permis 2000 white paper: measuring performance and intelligence of systems with autonomy. In: Meystel AM, Messina ER (eds) Measuring the performance and intelligence of systems: proceedings of the 2000 PerMIS Workshop, August 14–16, 2000, National Institute of Standards and Technology (NIST) Special Publication 970. Gaithersburg, pp 1–34

  • Meystel A, Albus J, Messina E, Leedom D (2003a) Performance measures for intelligent systems: measures of technology readiness. Technical report, DTIC Document

  • Meystel A, Albus J, Messina E, Leedom D (2003) Permis 2003 white paper: performance measures for intelligent systems—measures of technology readiness. In: Meystel AM, Messina ER (eds) Measuring the performance and intelligence of systems: proceedings of the 2003 PerMIS Workshop, National Institute of Standards and Technology (NIST) Special Publication 1014. Gaithersburg

  • Minsky ML (ed) (1968) Semantic information processing. MIT Press, Cambridge

    MATH  Google Scholar 

  • Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

    Article  Google Scholar 

  • Morgenstern L, Davis E, Ortiz-Jr CL (2016) Planning, executing, and evaluating the Winograd schema challenge. AI Mag 37(1):50–54

    Article  Google Scholar 

  • Mueller S, Jones M, Minnery B, Hiland JM (2007) The bica cognitive decathlon: a test suite for biologically-inspired cognitive agents. In: Proceedings of behavior representation in modeling and simulation conference, Norfolk

  • Mueller ST (2010) A partial implementation of the BICA cognitive decathlon using the psychology experiment building language (PEBL). Int J Mach Conscious 2(02):273–288

    Article  Google Scholar 

  • Mueller ST, Minnery BS (2008) Adapting the Turing Test for embodied neurocognitive evaluation of biologically-inspired cognitive agents. In: Proceedings of 2008 AAAI fall symposium on biologically inspired cognitive architectures

  • Newell A (1973) You can’t play 20 questions with nature and win: projective comments on the papers of this symposium. In: Chase W (ed) Vis Inf Process. Academic Press, New York, pp 283–308

    Google Scholar 

  • Newell A (1980) Physical symbol systems. Cogn Sci 4(2):135–183

    Article  Google Scholar 

  • Newell A (1990) Unified theories of cognition. Harvard University, Cambridge

    Google Scholar 

  • Newell A, Simon HA (1976) Computer science as empirical inquiry: symbols and search. Commun ACM 19(3):113–126

    Article  MathSciNet  Google Scholar 

  • Nizamani AR (2015) Reasoning with bounded cognitive resources. Ph.D. thesis, Department of Applied Information Technology, Chalmers University of Technology & University of Gothenburg, Sweden

  • Oppy G, Dowe DL (2011) The Turing Test. In: Zalta EN (ed) Stanford Encyclopedia of Philosophy, Stanford University. http://plato.stanford.edu/entries/turing-test/

  • Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  • Perez D, Samothrakis S, Togelius J, Schaul T, Lucas S, Couëtoux A, Lee J, Lim CU, Thompson T (2015) The 2014 general video game playing competition. IEEE Transactions on Computational Intelligence and AI in Games

  • Potthast M, Hagen M, Gollub T, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B (2013) Overview of the 5th international competition on plagiarism detection. CLEF (2013) Evaluation labs and workshop working notes papers, pp 23–26 September. Valencia, Spain

  • Proudfoot D (2011) Anthropomorphism and AI: Turing’s much misunderstood imitation game. Artif Intell 175(5):950–957

    Article  MathSciNet  Google Scholar 

  • Quinn AJ, Bederson BB (2011) Human computation: a survey and taxonomy of a growing field. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, pp 1403–1412

  • Rajani S (2011) Artificial intelligence—man or machine. Int J Inf Technol 4(1):173–176

    Google Scholar 

  • Rao RB, Fung G, Rosales R (2008) On the dangers of cross-validation. an experimental evaluation. In: SDM, SIAM, pp 588–596

  • Rohrer B (2010) Accelerating progress in artificial general intelligence: choosing a benchmark for natural world interaction. J Artif Gen Intell 2(1):1–28

    Article  Google Scholar 

  • Rothenberg J, Paul J, Kameny I, Kipps JR, Swenson M (1987) Evaluating expert system tools: a framework and methodology-workshops. Technical report, DTIC Document

  • Russell S, Norvig P (2009) Artificial intelligence: a modern approach. Prentice Hall, Upper Saddle River

    MATH  Google Scholar 

  • Sanghi P, Dowe DL (2003) A computer program capable of passing IQ tests. In: 4th international conference on cognitive science (ICCS’03), Sydney, pp 570–575

  • Schaeffer J, Burch N, Bjornsson Y, Kishimoto A, Muller M, Lake R, Lu P, Sutphen S (2007) Checkers is solved. Science 317(5844):1518

    Article  MathSciNet  MATH  Google Scholar 

  • Schaie KW (2010) Primary mental abilities. Corsini Encyclopedia of Psychology

  • Schaul T (2014) An extensible description language for video games. IEEE Trans Comput Intell AI Games PP(99):1–1. doi:10.1109/TCIAIG.2014.2352795

  • Schenck C (2013) Intelligence tests for robots: Solving perceptual reasoning tasks with a humanoid robot. Master’s thesis, Iowa State University

  • Schlenoff C, Scott H, Balakirsky S (2011) Performance evaluation of intelligent systems at the National Institute of Standards and Technology (NIST). Technical report, DTIC Document

  • Schmid U, Ragni M (2015) Comparing computer models solving number series problems. In: Artificial general intelligence. Springer, pp 352–361

  • Schweizer P (1998) The truly total Turing test. Minds Mach 8(2):263–272

    Article  MathSciNet  Google Scholar 

  • Searle JR (1980) Minds, brains, and programs. Behav Brain Sci 3:417–457

    Article  Google Scholar 

  • Seber GAF, Salehi MM (2013) Adaptive cluster sampling. In: Adaptive sampling designs. Springer, pp 11–26

  • Settles B (2012) Active learning. Synth Lect Artif Intell Mach Learn 6(1):1–114

    Article  MathSciNet  MATH  Google Scholar 

  • Shettleworth SJ (2010) Cognition, evolution, and behavior. Oxford University Press, Oxford

    Google Scholar 

  • Shettleworth SJ, Bloom P, Nadel L (2013) Fundamentals of comparative cognition. Oxford University Press, Oxford

    Google Scholar 

  • Shieber SM (2016) Principles for designing an AI competition, or why the Turing test fails as an inducement prize. AI Mag 37(1):91–96

    Article  Google Scholar 

  • Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489

    Article  Google Scholar 

  • Simmons R (2000) Survivability and competence as measures of intelligent systems. In: Meystel AM, Messina ER (eds) Measuring the performance and intelligence of systems: proceedings of the 2000 PerMIS Workshop, August 14–16, 2000, National Institute of Standards and Technology (NIST) Special Publication 970. Gaithersburg, pp 162–163

  • Simon HA (1995) Artificial intelligence: an empirical science. Artif Intell 77(1):95–127

    Article  Google Scholar 

  • Sloman A, Scheutz M (2002) A framework for comparing agent architectures. Proceedings of UKCI 2

  • Smith WD (2002) Rating systems for gameplayers, and learning. NEC, Princeton, NJ, Technical report, pp 93–104

  • Smith WD (2006) Mathematical definition of “intelligence” (and consequences). Unpublished report

  • Soares C (2009) UCI++: improved support for algorithm selection using datasetoids. In: Advances in knowledge discovery and data mining. Springer, pp 499–506

  • Solomonoff R (1996) Does algorithmic probability solve the problem of induction. Inf Stat Induction Sci 7–8

  • Solomonoff RJ (1964) A formal theory of inductive inference. Part I. Inf Control 7(1):1–22

    Article  MathSciNet  MATH  Google Scholar 

  • Solomonoff RJ (1984) Optimum sequential search. Oxbridge Research, Cambridge. http://world.std.com/~rjs/optseq.pdf

  • Srinivasan R (2002) Importance sampling: applications in communications and detection. Springer, New York

    Book  MATH  Google Scholar 

  • Starkie B, van Zaanen M, Estival D (2006) The Tenjinno machine translation competition. In: Grammatical inference: algorithms and applications. Springer, pp 214–226

  • Sternberg RJ (ed) (2000) Handbook of intelligence. Cambridge University Press, Cambridge

    Google Scholar 

  • Strannegård C, Amirghasemi M, Ulfsbücker S (2013a) An anthropomorphic method for number sequence problems. Cogn Syst Res 22–23:27–34

    Article  Google Scholar 

  • Strannegård C, Nizamani A, Sjöberg A, Engström F (2013b) Bounded Kolmogorov complexity based on cognitive models. In: Kühnberger KU, Rudolph S, Wang P (eds) Artificial general intelligence. Lecture notes in computer science, vol 7999. Springer, Berlin Heidelberg, pp 130–139

    Chapter  Google Scholar 

  • Strickler RE (1973) Change in selected characteristics of students between ninth and twelfth grade as related to high school curriculum

  • Sturtevant N (2012) Benchmarks for grid-based pathfinding. Trans Comput Intell AI Games 4(2):144–148. http://web.cs.du.edu/~sturtevant/papers/benchmarks.pdf

  • Sutcliffe G (2009) The TPTP problem library and associated infrastructure: the FOF and CNF Parts, v3.5.0. J Autom Reason 43(4):337–362

    Article  MATH  Google Scholar 

  • Sutcliffe G, Suttner C (2006) The state of CASC. AI Commun 19(1):35–48

    MathSciNet  MATH  Google Scholar 

  • Thrun S (1996) Is learning the n-th thing any easier than learning the first? In: Advances in neural information processing systems, pp 640–646

  • Thrun S, Pratt L (2012) Learning to learn. Springer, New York

    MATH  Google Scholar 

  • Thurstone LL (1938a) Primary mental abilities. Psychometric monographs

  • Thurstone LL (1938b) Primary mental abilities. Psychometric monographs

  • Togelius J, Yannakakis GN, Karakovskiy S, Shaker N (2012) Assessing believability. In: Believable bots, Springer, pp 215–230

  • Torrey L, Shavlik J (2009) Transfer learning. Handb Res Mach Learn Appl 3:17–35

    Google Scholar 

  • Turing AM (1950) Computing machinery and intelligence. Mind 59:433–460

    Article  MathSciNet  Google Scholar 

  • Valiant LG (1984) A theory of the learnable. Commun ACM 27(11):1134–1142

    Article  MATH  Google Scholar 

  • Vallati M, Chrpa L, Grzes M, McCluskey TL, Roberts M, Sanner S (2015) The 2014 international planning competition: progress and trends. AI Mag 36(3):90–98

    Article  Google Scholar 

  • van Rijn JN, Bischl B, Torgo L, Gao B, Umaashankar V, Fischer S, Winter P, Wiswedel B, Berthold MR, Vanschoren J (2013) Openml: a collaborative science platform. In: Machine learning and knowledge discovery in databases. Springer, pp 645–649

  • Vanschoren J, Blockeel H, Pfahringer B, Holmes G (2012) Experiment databases. Mach Learn 87(2):127–158

    Article  MathSciNet  MATH  Google Scholar 

  • Vanschoren J, van Rijn JN, Bischl B, Torgo L (2014) Openml: networked science in machine learning. ACM SIGKDD Explor Newsl 15(2):49–60

    Article  Google Scholar 

  • Vázquez D, López AM, Marín J, Ponsa D, Gerónimo D (2014) Virtual and real world adaptation for pedestrian detection. IEEE Trans Pattern Anal Mach Intell 36(4):797–809. doi:10.1109/TPAMI.2013.163

    Article  Google Scholar 

  • Vere SA (1992) A cognitive process shell. Behav Brain Sci 15(03):460–461

    Article  Google Scholar 

  • von Ahn L (2009) Human computation. In: Design automation conference, 2009. DAC’09. 46th ACM/IEEE, IEEE, pp 418–419

  • von Ahn L, Blum M, Langford J (2004) Telling humans and computers apart automatically. Commun ACM 47(2):56–60

    Article  Google Scholar 

  • von Ahn L, Maurer B, McMillen C, Abraham D, Blum M (2008) RECAPTCHA: human-based character recognition via web security measures. Science 321(5895):1465

    Article  MathSciNet  MATH  Google Scholar 

  • Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194

    Article  MATH  Google Scholar 

  • Wallace CS, Dowe DL (1999) Minimum message length and Kolmogorov complexity. Comput J 42(4):270–283 (special issue on Kolmogorov complexity)

    Article  MATH  Google Scholar 

  • Wang G, Mohanlal M, Wilson C, Wang X, Metzger M, Zheng H, Zhao BY (2012) Social Turing tests: crowdsourcing sybil detection. arXiv preprint arXiv:1205.3856

  • Wang P (2010) The evaluation of agi systems. In: Proceedings of the third conference on artificial general intelligence, Citeseer, pp 164–169

  • Warwick K (2014) Turing Test success marks milestone in computing history. University or Reading Press Release,

  • Wasserman EA, Zentall TR (2006) Comparative cognition: Experimental explorations of animal intelligence. Oxford University Press, Oxford

    Google Scholar 

  • Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8(3):279–292

    MATH  Google Scholar 

  • Weiss DJ (2011) Better data from better measurements using computerized adaptive testing. J Methods Meas Soc Sci 2(1):1–27

    Article  MathSciNet  Google Scholar 

  • Weizenbaum J (1966) ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM 9(1):36–45

    Article  Google Scholar 

  • Wellman M, Reeves D, Lochner K, Vorobeychik Y (2004) Price prediction in a trading agent competition. J Artif Intell Res JAIR 21:19–36

    Google Scholar 

  • White DR, McDermott J, Castelli M, Manzoni L, Goldman BW, Kronberger G, Jaśkowski W, O’Reilly UM, Luke S (2013) Better GP benchmarks: community survey results and proposals. Genet Program Evolvable Mach 14:3–29. doi:10.1007/s10710-012-9177-2

    Article  Google Scholar 

  • Whiteson S, Tanner B, White A (2010) The reinforcement learning competitions. AI Mag 31(2):81–94

    Article  Google Scholar 

  • Whiteson S, Tanner B, Taylor ME, Stone P (2011) Protecting against evaluation overfitting in empirical reinforcement learning. In: 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), IEEE, pp 120–127

  • Williams PL, Beer RD (2010) Information dynamics of evolved agents. In: From animals to animats 11, Springer, pp 38–49

  • Winikoff M, Cranefield S (2014) On the testability of bdi agent systems. J Artif Intell Res JAIR 51:71–131

    MathSciNet  MATH  Google Scholar 

  • Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390

    Article  Google Scholar 

  • Wolpert DH (2012) What the no free lunch theorems really mean; how to improve search algorithms. Technical report, Santa fe Institute Working Paper

  • Wolpert DH, Macready WG (1995) No free lunch theorems for search. Technical report SFI-TR-95-02-010 (Santa Fe Institute)

  • Wolpert DH, Macready WG (2005) Coevolutionary free lunches. IEEE Trans Evol Comput 9(6):721–735

    Article  Google Scholar 

  • Yampolskiy RV (2015) Artificial superintelligence: a futuristic approach. CRC Press, Boca Raton

    Google Scholar 

  • Yonck R (2012) Toward a standard metric of machine intelligence. World Future Rev 4(2):61–70

    Article  Google Scholar 

  • You J (2015) Beyond the turing test. Science 347(6218):116–116

    Article  Google Scholar 

  • Zatuchna Z, Bagnall A (2009) Learning mazes with aliasing states: an LCS algorithm with associative perception. Adapt Behav 17(1):28–57

    Article  Google Scholar 

  • Zhou ZH (2012) Ensemble methods: foundations and algorithms. CRC Press, Boca Raton

    Google Scholar 

Download references

Acknowledgments

I thank the organisers of the AEPIA Summer School On Artificial Intelligence, held in September 2014, for giving me the opportunity to give a lecture on ‘AI Evaluation’. This paper was born out of and evolved through that lecture. The information about many benchmarks and competitions discussed in this paper have been contrasted with information from and discussions with many people: M. Bedia, A. Cangelosi, C. Dimitrakakis, I. GarcÍa-Varea, Katja Hofmann, W. Langdon, E. Messina, S. Mueller, M. Siebers and C. Soares. Figure 4 is courtesy of F. Martínez-Plumed. Finally, I thank the anonymous reviewers, whose comments have helped to significantly improve the balance and coverage of the paper. This work has been partially supported by the EU (FEDER) and the Spanish MINECO under Grants TIN 2013-45732-C4-1-P, TIN 2015-69175-C4-1-R and by Generalitat Valenciana PROMETEOII2015/013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José Hernández-Orallo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hernández-Orallo, J. Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement. Artif Intell Rev 48, 397–447 (2017). https://doi.org/10.1007/s10462-016-9505-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-016-9505-7

Keywords

Navigation