Introduction
Building machines that learn and think like humans is one of the core ambitions of Artificial Intelligence (AI) and Machine Learning (ML) in particular. On the quest for this goal, artificial learners have made groundbreaking accomplishments in many domains spanning object recognition, image processing, speech recognition, medical information processing, robotics and control, bioinformatics, natural language processing (NLP), cybersecurity, and many others. Their success has captured attention beyond academia. In industry, many companies such as Google and Facebook devoted active research instances to explore these technologies.
Ultimately, AI has succeeded to speed up its pace to be like humans and even defeats humans in some fields. AlphaGo [
1] defeats human champions in the ancient game of Go. The deep network ResNet [
2] obtains better classification performance than humans on ImageNet. And, recently, Google launched Meena [
3] a human-like AI chatbot that can conduct sensible and specific conversations like humans. However, there is another side to this coin, the impressive results achieved with modern ML (in particular by deep learning) are made possible largely by the use of huge datasets. For instance, DeepMinds’s AlphaGo used more than 38 million positions to train their algorithm to play Go. The ImageNet database used by ResNet contains about 1.2 million labeled examples. And Meena has been trained on a massive 341 GB corpus, the equivalent of roughly 341,000 books, far more than most people read in a lifetime. This is obviously far from human-like learning. One thing that makes human learners so efficient is that we are active, strategic information-seekers. We learn continually and we make use of our previous experiences. So far, it is not clear how to replicate such abilities into artificial learners. In the big data era, algorithms continue to be more data-hungry, while real facts indicate that many application domains can often only use a few data points because acquiring them involves a process that is expensive or time-consuming.
Consequently, many researchers and engineers began to recognize that the progress of ML is highly dependent on the premise of the availability of large number of input samples (generally with annotations). Without massive data, ML success is uncertain. Hence, key ML researchers have sounded a cautionary note regarding the data hunger behavior of algorithms. In his controversial work “Deep Learning: A Critical Appraisal” [
4], Marcus listed ten concerns about deep learning research, at the top of the list data hungriness. he noted that “in problems where data are limited, deep learning often is not an ideal solution”. Data-hungriness was also included in the unsolved problems in AI research, described in the book “Architects of Intelligence” by Martin Ford [
5]. Most of the experts interviewed in this book are calling for more data-efficient algorithms. For instance, Oren Etzioni quoted in the book that “stepping stone [towards AGI] is that it’s very important that [AI] systems be a lot more data-efficient. So, how many examples do you need to learn from?” [
5, p. 502].
As such, our work extends the recent call for more research on data-efficient algorithms. In fact, we view these concerns as an opportunity to examine in-depth what it means for a machine to learn efficiently like humans, what are the efforts deployed to alleviate data- hungriness, and what are the possible research avenues to explore. Studying data hungriness of ML algorithms is unfortunately a topic that has not yet received sufficient attention in the academic research community, nonetheless, it is of big importance and impact. Accordingly, the main aim of this survey is to stimulate research on this topic by providing interested researchers with a clear picture of the current research landscape.
Prior to this paper, we know of few works that attempted to investigate the issue of data-hungriness. Shu et al. [
6] proposed a survey that covers learning methods for small samples regime, they focused on concept learning and experience learning. Wang et al. [
7] surveyed few-shot learning by putting on light methods operating at the level of data, model, and algorithms. Qi et al. [
8] discussed small data challenges from unsupervised and semi-supervised learning perspectives, they presented an up-to-date review of the progress in these two paradigms. As a matter of fact, existing surveys are limited in the way they approach the problem and in the scope they cover. In contrast, our work seeks comprehensiveness, we tackle the issue from an interdisciplinary perspective and discussed potential solutions from different backgrounds and horizons. In doing so, we brought different concepts under one roof that were never discussed together and tried to draw connections between them. Furthermore, as all AI players are concerned by the issue, while elaborating the survey we deliberately tried to make it accessible to the non-theoretician while still providing precise arguments for the specialist. In this respect, we make three main contributions:
-
We propose a comprehensive background regarding the causes, manifestations, and implications of data-hungry algorithms. For a good understanding of the issue, particular attention is given to the nature and the evolution of the data/algorithm relationship.
-
Based on an analysis of the literature, we provide an organized overview of the proposed techniques capable of alleviating data hunger. Through this overview, readers will understand how to expand limited datasets to take advantage of the capabilities of big data.
-
In our discussions, we identify many directions to accompany previous and potential future research. We hope it inspires more researchers to engage in the various topics discussed in the paper.
Accordingly, the remainder of the survey is organized as follows. “
Background” section presents a preliminary background. “
Review” section surveys existing solutions and organizes surveyed approaches according to four research strategies. “
Discussion” section discusses research directions and open problems that we gathered and distilled from the literature review. Finally, “
Conclusion” section concludes this survey.
Discussion
All in all, there is no general standard solution regarding how to cure data hungriness, many perceptions exist but none of them can be asserted to be an absolute solution. Beyond research laboratories, results produced in real-world conditions indicate that existing techniques are yet to be industrialized. And more importantly, with the absence of rational metrics to evaluate and compare techniques, we cannot objectively justify the choice of a technique over another. That being said, we believe that research on this issue is just in its infancy. Without a doubt, considering the facts from industrial and academic worlds, it is a golden time for data-efficient algorithms to rise. However, considering what has been done in the literature so far, improvements are expected from the community working on the issue in order to advance research in this area. In this section, we discuss some research directions and open challenges distilled from the surveyed works, we propose to group them in four themes, namely: (i) Hybridization, (ii) Evaluation, (iii) Automation, and (iv) Humanization.
-
Hybridization. The last strategy discussed in the review advocates the use of hybrid systems in order to benefit from the strength of each component and achieve more powerful systems. This perception is an interesting avenue for future research, in the sense that further value-added combinations can be investigated.
In the literature, we have seen how some techniques from the same strategy can be used in complementary to each other, like generative augmentations and basic transformations in DA, and meta-learning and TL in knowledgeable systems. However, works that study this kind of composition are still limited in both variety and depth. Furthermore, hybridization of techniques from different strategies is restrictively steered, in some way, towards almost one direction; combining DA with TL for DNN as an effective method for reducing overfitting, improving model performance, and quickly learning new tasks with limited dataset. Much research has been devoted into this vein [
287‐
289], the aim is to develop practical software tools for systematical integration of DA and TL into deep learning workflows and helping engineers utilize the performance power of these techniques much faster and more easily. It is indeed the best we can hope to empower DNN and mitigate its limitations. However, we believe it is also healthy to explore the potential of other innovative combinations similar to neural-symbolic systems, that not only integrate the reviewed techniques but also call upon other domains such as evolutionary approaches, statistical models, and cognitive reasoning. In this sense, multi-disciplinary studies like this paper are needed to build links between backgrounds and domains that are studied separately and to bring closer their bodies of research that are moving in different directions. Here, we intuitively and seamlessly bridged between the different strategies by considering, for instance, that FSL problem can be viewed as a semi-supervised learning problem with few available labeled data. Its aim is to transfer the knowledge of learning (e.g., meta-learning) from the source tasks to the target ones. Domain adaptation, which is a particular way of transfer learning is also a useful technique of data augmentation. We believe that making connections and enabling hybridization is a rich, under-explored area for future research that could help to converge to one unified solution. A general, adaptable, data-resistant system that will perform well in domains where ample data is available but also in data-scarce domains. It’s far from obvious how to combine all the pieces and to explore others to conceive such custom systems that work on both settings, but researchers have to shift their attention towards this goal in order to fill the gap in thinking of how to build robust AI.
Semi-supervised and unsupervised methods are often evaluated based on their performances on downstream tasks by using datasets such as CIFAR-10, ImageNet, Places, and Pascal VOC. CIFAR-10 and SVHN are popular choices for evaluating the performances of semi-supervised models by training them with all unlabeled data and various amount of labeled examples. To provide a realistic evaluation, it is important to establish more high-quality baselines to allow for proper assessment of the added value of the unlabeled data. Researchers should thus evaluate their algorithms on a diverse suite of data sets with different quantities of unlabeled data and report how performance varies with the amount of unlabeled data. Oliver et al. [
119] compared several SSNN on two image classification problems. They reported substantial performance improvements for most of the algorithms, and observed that the error rates declined as more unlabeled data points were added. These results are interesting in the sense that they indicate that, in image classification tasks, unlabeled data used by ANN can drive consistent improvement in performance. Likewise, it would be interesting to explore more empirical evaluations to draw more promising results that will guide research for better unlabeled data-based learners.
As for DA and knowledgeable systems, more theory and formalisms are needed in order to accurately compare and fairly evaluate techniques from these strategies. Indeed, despite the rapid progress of practical DA techniques, precisely understanding their benefits remains ambiguous. There is no common theoretical understanding regarding how training on augmented data affects the learning process, the parameters, and the overall performance. This is exacerbated by the fact that DA is performed in diverse ways in modern ML pipelines, for different tasks and domains, thus precluding a general theoretical framework. Hence, more theoretical insights are expected to theoretically characterize and understand the effect of various data augmentations used in practice in order to be able to evaluate their benefits. On the other hand, knowledgeable systems research community still does not have a good understanding of what the knowledge is in general, how to represent knowledge, and how to use knowledge in learning effectively. A unified theory of knowledge and the related issues is urgently needed in order to compare between knowledgeable systems and to measure how they optimize data requirement.
Certainly, enriching the evaluation baselines of each strategy is an important research avenue to pursue. However, the ultimate goal would be to develop approaches to evaluate in an abstract level, that is to be able to evaluate an altered data-hungry system by measuring how the alteration techniques, abstracting from their nature, have optimized the need for data, and by verifying the performance resistance against the change in the availability of data.
Automation. A common research question discussed in the reviewed strategies is automated design. Automatic generation of a DA schema for a given dataset or automatic learning of a transfer algorithm for a given domain or tasks, are examples of the projection of the general concept of Automated Machine Learning (AutoML) [
290]. AutoML has recently emerged as a novel idea of automating the entire pipeline of learners’ design by using ML to generate better ML. AutoML is advertised as a mean to democratize ML by allowing firms with limited data science expertise to easily build production-ready models in an automatic way, which will accelerate processes, reduce errors and costs, and provide more accurate results, as it enables businesses to select the best-performing algorithm. Practically, AutoML automates some or all steps of a standard ML pipeline that includes data preparation, feature engineering, model generation, and model evaluation [
290]. Hence, one of the missions of autoML is to automatically manage data quality and quantity in the first step of the pipeline. Currently, autoML services rely only on data searching [
291] and data simulator [
292] to deal with effective data acquiring, we expect however that advances in autoML will deeply revolutionize the way we deal with data needs in ML pipeline.
Furthermore, it is worth to highlight the strong interaction between autoML and the reviewed techniques. As discussed before autoML as a general concept can also be instantiated for DA [
172‐
175] and TL [
197,
198] solutions that can also be packaged in an end-to-end automatic process. On the other way around, DA, TL, and other techniques are very useful for autoML tools. In the data preparation step, DA can be regarded as a tool for data collection and as a regularizer to avoid overfitting. In model generation step, as auoML become most popular for the design of deep learning architectures, neural architecture search (NAS) techniques [
293] which target at searching for good deep network architectures that suit the learning problem are mostly used in this step. However, this method has a high computational cost, to address this, TL can use knowledge from prior tasks to speed up network design. In this vein, Wong et al. [
294] proposed an approach that reduces the computational cost of Neural AutoML by using transfer learning. They showed a large reduction in convergence time across many datasets. Existing AutoML algorithms focus only on solving a specific task on some fixed datasets. However, a targeted high-quality AutoML system should have the capability of lifelong learning. Pasunuru et al. [
295], introduced a continual architecture search (CAS) approach enabling lifelong learning. In addition, as the core idea of auoML is to learn to learn, it is natural to find a growing body of research that combines meta-learning and autoML, particularly for NAS improvement [
296,
297]. AutoML has also been studied in few-shot learning scenarios, for instance, Elsken et al. [
297] applied NAS to few-shot learning to overcome the data scarcity, while they only search for the most promising architecture and optimize it to work on multiple few-shot learning tasks. Recently, the idea of unsupervised autoML has begun to be explored, Liu et al. [
298] proposed a general problem setup, namely unsupervised neural architecture search (UnNAS), to explore whether labels are necessary for NAS. They experimentally demonstrated that the architectures searched without labels are competitive with those searched with labels.
-
Humanization. At the root of every intelligent system is the dream of building machines that learn and think like people. Naturally, all attempts to cure data hunger behavior of ML models stem from mimicking the mechanism of human. Currently, humans still retain a clear advantage in terms of sample efficiency of learning. Hence, an obvious research path to keep pursuing is to explore more human-inspired theories and human-like techniques.
We contend that the quest for non-data hungry learning may profit from the rich heritage of problem descriptions, theories, and experimental tools developed by cognitive psychologists. Cognitive psychologists promote a picture of learning that highlights the importance of early inductive biases, including core concepts such as number, space, and objects, as well as powerful learning algorithms that rely on prior knowledge to extract knowledge from small amounts of training data. Studies and insights drawn from cognitive and psychology can then potentially help to examine and understand mechanisms underlying human learning strengths. After all, FSL has been modeled after children’s remarkable cognitive processes to generalize a new concept from a small number of examples. According to developmental psychologists, humans fast learning is hugely reliant on cognitive biases, Shinohara et al. [
299] suggested symmetric bias and mutually exclusive bias as the two most promising cognitive biases that can be effectively employed in ML tasks. Following this line of thought, many advances might come from exploring other cognitive abilities, an interesting avenue might be the study of the commonsense knowledge, how it develops, how it is represented, how it is cumulated, and how it is used in learning. A related study would be to explore intuitive learning theories of physical and social domains. Children at early age have primitive knowledge of physics and social rules, whether learned or innate, it is an intriguing area of research to investigate the prospects for embedding or acquiring this kind of intuitive knowledge in machines, and to study how this could help capturing more human-like learning-to-learn dynamics that enable much stronger transfer to new tasks and new problems, and thus, accelerate the learning of new tasks from very limited amounts of experience and data.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.