Finds documents with both search terms in any word order, permitting "n" words as a maximum distance between them. Best choose between 15 and 30 (e.g. NEAR(recruit, professionals, 20)).
Finds documents with the search term in word versions or composites. The asterisk * marks whether you wish them BEFORE, BEHIND, or BEFORE and BEHIND the search term (e.g. lightweight*, *lightweight, *lightweight*).
The chapter 'Play it Straight: An Intelligent Data Pruning Technique for Green-AI' addresses the escalating energy usage and carbon emissions associated with AI model training, particularly focusing on the energy-intensive nature of deep learning methods. It introduces the concept of Green-AI, which aims to minimize the energy and carbon footprint of AI systems. The chapter explores existing data pruning methods, highlighting their limitations and the need for more efficient techniques. The proposed algorithm, Play it Straight, combines an RS2-based DNN warm-up step with an iterative AL-like scheme to refine the DNN with selections of informative data instances. This approach aims to reduce both computational costs and energy consumption without compromising model accuracy. The chapter includes a detailed experimental evaluation on benchmark datasets, demonstrating the effectiveness of Play it Straight in achieving significant computational savings while maintaining high accuracy levels. This innovative approach aligns with the goals of Green-AI, making it particularly relevant for resource-constrained environments and promoting more sustainable AI practices.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
The escalating climate crisis demands urgent action to mitigate the environmental impact of energy-intensive technologies, including Artificial Intelligence (AI). Lowering AI’s environmental impact requires adopting energy-efficient approaches for training Deep Neural Networks (DNNs). One such approach is to use Dataset Pruning (DP) methods to reduce the number of training instances, and thus the total energy consumed. Numerous DP methods have been proposed in the literature (e.g., GraNd and Craig), with the ultimate aim of speeding up model training. On the other hand, Active Learning (AL) approaches, originally conceived to repeatedly select the best data to be labeled by a human expert (from a large collection of unlabeled data), can be exploited as well to train a model on a relatively small subset of (informative) examples. However, despite allowing for reducing the total amount of training data, most DP methods and pure AL-based schemes entail costly computations that may strongly limit their energy saving potential. In this work, we empirically study the effectiveness of DP and AL methods in curbing energy consumption in DNN training, and propose a novel approach to DNN learning, named Play it straight, which efficiently combines data selection methods and AL-like incremental training. Play it straight is shown to outperform traditional DP and AL approaches, achieving a better trade-off between accuracy and energy efficiency.
1 Introduction
Recent advancements in Artificial Intelligence (AI) have revolutionized numerous industries, providing cutting-edge solutions to complex challenges. AI’s influence extends across healthcare, finance, manufacturing, and more, fundamentally changing our personal and work lives. However, this rapid expansion poses serious concerns related to the escalation of energy usage and associated carbon emissions [5]. A major factor in AI’s energy-intensive nature comes from the training of data-driven AI models with Deep Learning methods, since massive data volumes and compute are needed to build effective Deep Neural Network (DNN) models, so incurring a considerable surge in energy usage [28]. In particular, AI-related carbon emissions mainly originate from the electricity consumed during model training. Indeed, since electricity generation heavily depends on non-renewable sources like coal and natural gas (which still remain the cornerstone of global energy production [6]), AI model training plays a role in exacerbating climate change issues.
As a response to economic and environmental sustainability issues, the emerging research field of Green-AI is committed to lessen the energy and carbon footprint of AI systems and promote the creation of energy-conscious deep learning models and algorithms. Key areas of Green-AI research include: (i)Minimizing energy usage: Creating AI models and algorithms that demand less energy during training and operation; (ii)Harnessing renewable energy sources: Utilizing clean energy options like solar and wind power to drive AI processes; (iii)Optimizing hardware: Engineering AI-specific hardware designed for superior energy efficiency.
Advertisement
This research work specifically focuses on the problem of efficiently combining data selection and deep learning methods to curb the energy-consumption impact of deep AI models’ learning while ensuring a satisfactory model accuracy.
Existing Solutions. Several approaches have been proposed to address this issue. In particular, Data Pruning (DP) methods [8] allow for extracting a compact sample (a.k.a. coreset) of a given large dataset, which is meant to retain information relevant to some target data analyses. Usually, such a sample is meant to play as a smaller and cheaper substitute of the original dataset in performing some costly machine learning task [22]. A wide variety of DP solutions have been proposed in the literature, which feature different strengths and weaknesses, and can be grouped in the following main categories: geometry-based methods [26], loss/error based methods [19], gradient matching methods [15], bilevel-optimization methods [31]), and sub-modular methods [10]. Unfortunately, most DP methods entail heavy computations, which may vanish the benefit of shrinking the training set in some application settings. In fact, recent studies [2, 8, 16] revealed that random sampling schemes are a strong data-reduction baseline, which often achieves similar/superior performances to DP methods. Starting from this observation, the Repeated Sampling of Random Subsets (RS2) [16] method was recently proposed that attempts to cut training costs by randomly selecting a subset of data for each training epoch. Anyway, a common drawback of the above-mentioned data pruning/sampling methods resides in the fact that the user must guess the amount of data needed beforehand; otherwise, the process needs to be repeated, so incurring in a waste of time and energy.
In principle, as proposed in [17, 24], Active Learning (AL) approaches (originally aimed at saving labeling costs by repeatedly picking few informative instances up from a large unlabeled dataset), can be exploited to cut the cost of training a DNN model over large volumes of labeled data, thanks to their ability to focus on a subset of informative examples. However, using a standard AL scheme, involving repeated full model retraining steps over growing subsets of examples, as done in [17], may be too energy demanding —as shown empirically in our experimentation.
Contribution. In the light of the limitations of extant data pruning/sampling approaches to efficient DNN learning, in this paper we introduce Play it straight, an algorithm that synergistically combines an RS2-based DNN warm-up step with an iterative AL-like scheme to efficiently refine the DNN with selections of informative data instances.
Fig. 1.
Overview of the proposed Play it straight method. Symbol
here denotes the DNN model being trained.
In general, AL methods rely on repeatedly choosing a subset of unlabeled data to retrain a model, till a desired accuracy level is reached or a predefined (labeling) budget is consumed. A similar iterative model refinement scheme is here extended to address the problem of efficiently training a DNN model against a large collection of labeled data, starting from a preliminary version of the model, obtained with the help of algorithm RS2 [16].
In more detail, as pictorially sketched in Fig. 1, the two-phase training approach of Play it straight begins with a “boot” phase where a given (randomly initialized) DNN model
is partially trained over the entire dataset by running the RS2 procedure using a low value for its reduction factor (so that a small number of model optimization steps are performed); as confirmed by our experimental analysis, this boot phase is expected to efficiently produce an informed initial setting of the DNN that allows for reliable enough importance scores to the data instances.
The second,“fine-tune”, phase then consists in repeatedly selecting small-sized instance subsets, based on their associated importance scores, and exploiting them to incrementally fine-tune the DNN model. This fine-tune loop ends when either the target accuracy, computed on the test set, or the maximum energy budget stated by the user have been overcome. In this AL-like training process, a key design choice concerns the data selection strategy to be used at each fine-tune round: since the selection procedure must be applied to a potentially large amount of data, it must introduce small computation overhead. Thus, we propose to perform each data selection step by using a combination of uncertainty-based scores (namely, least-confidence, entropy or margin based) with error-based scores, both of which are quite cheap to compute.
Experiments performed on two benchmark datasets have shown that Play it straight can substantially reduce the compute and energy consumed, in training a relatively large DNN model, without compromising accuracy performances. These empirical results make us confident in the potentiality of the proposed approach in fostering a more sustainable approach to the training of DNN models, which looks particularly important for Green-AI application context.
Organization. The rest of this paper is structured as follows: After providing an overview of existing approaches in Sect. 2, the proposed method Play it straight is illustrated in detail, in Sect. 3. The experimental study conducted to evaluate this method, comparatively with previous ones in the field, is then illustrated in Sect. 4. The concluding section finally discusses the main experimental findings and the main contributions of our research work, as well as some limitations of it and future research directions.
2 Related Work
Dataset Reduction for Green AI. The significance of energy conservation has long been recognized [7, 18, 30], leading to ongoing advancements in power consumption estimation methodologies. Alongside these theoretical developments, practical tools for building energy consumption modeling have emerged. Traditional training methods use large data volumes in a monolithic training phase, which can be computationally expensive, especially for large datasets and complex models. In general, reducing the size of training data can then be a lever for cutting computational costs and energy consumption, at least linearly in the amount of pruned data. Indeed, dataset reduction methods [14, 29, 32, 33] have gained significant attention in machine learning due to their potential to improve model generalization and reduce computational costs. These methods can be divided into two main categories: selection-based and synthesis-based methods. Selection-based (a.k.a. dataset pruning) methods extract a small sample (or coreset) of the most relevant samples from the original dataset. Synthesis-based (a.k.a. dataset distillation) methods, on the other hand, create a new, smaller dataset by condensing the information from the original dataset. The synthesized data aim to accurately represent the original data distribution, even though they are not taken directly from the original data. For example, a class containing hundreds of images could be condensed into a single, more abstract yet information-rich image.
Dataset Pruning/Sampling for Efficient Training. Since our current research focuses on selecting a subset of the available data instances for training a DNN model, let us focus on the first category of solutions [8]. Quite a wide variety of data pruning methods have been proposed in the literature, featuring different strengths and weaknesses. A wide variety of DP methods have been proposed over the years, which include the following ones as major representatives: geometry-based methods like K-Center Greedy [26]; loss/error based methods like Gradient Normed (GraNd) and the Error L2-Norm (EL2N) [19]; gradient matching methods like CRAIG [15] (seeking a data instance subset on which the aggregated model gradients match those on the full dataset); bilevel-optimization methods like GLISTER [31]; and sub-modular optimization methods like GraphCut [10].
However, many of these methods entail heavy computation in order to eventually identify a data sample that is both small and representative/useful enough for subsequently training a DNN model. Even though this cost could can be amortized across several model training sessions (possibly including hyperparameter optimization), one must take full account for the energy impact of it.
Notably, recent empirical studies [2, 8, 16] revealed that a pure (uniform) random sampling scheme often allows for learning DNN models enjoying prediction performances quite similar (or even superior) to sophisticated Data Pruning/Selection methods, especially in the high-compression regime [2]. Starting from this observation, recent efforts have been made to integrate pure random sampling in training a neural-network model, to achieve a better trade-off between the representativeness and diversity of the sampled data and the efficiency of the training process as a whole. Such a strategy is at the core of the Repeated Sampling of Random Subsets (RS2) method proposed in [16], which essentially consists in randomly sampling a subset of training data for each epoch.
In general, however, extant Data Pruning and Sampling methods (including RS2) suffer from a drawback that may limit their practical value in some real-life applications: the user is required to carefully set a data-selection budget (i.e. a sampling percentage) beforehand in order to eventually achieve a desired level of accuracy. Hence, if this budget is set inadequately, it may be necessary to repeat the data-pruning and training processes as a whole, significantly increasing consumption.
Clearly, data pruning is not the only way to reduce model training costs. Other approaches to this task include, for example, modifying the sampling distribution during training, like in [11], where an importance sampling-based algorithm is introduced that accelerates training by exploiting a gradient-norm upper bound. An alternative approach consists ion scaling sample losses during training, like in [20], where the SGD optimization scheme is biased towards more important samples, identified after a few training epochs, by sampling them more frequently during the remaining training. In a similar vein, Chang et al. [3] proposed to leverage lightweight estimations of sample uncertainty within SGD: variance in predicted probabilities and proximity to decision thresholds.
Efficient Training via Active Learning (AL). To the best of our knowledge, S. Salehi et al. [23] has been the first study on the application of AL techniques to Green AI contexts, as a possible way to reduce the energy footprint of model training. Park et al. [17] proposed to use an AL framework for data pruning, demonstrating its effectiveness but at a high energy cost without optimizations. In this context, some widely used AL approaches, such as uncertainty sampling strategies [27], offer relatively low energy consumption. These strategies involve the model selecting data points it is most uncertain about for labeling, with the idea that labeling the most uncertain points provides the most information, improving performance with fewer labeled samples. Common uncertainty sampling criteria include Entropy, Margin and Least Confidence sampling [27]. In fact, definitely higher computational burdens would be introduced by more sophisticated approaches, like the BAIT method proposed in [1], which tries to optimize a bound on the Maximum Likelihood Estimators (MLE) error using Fisher information to guide batch sample selection.
However, using a classical AL-based scheme as done in [23] is unsuitable for Green-AI settings, owing to the high computational cost of repeatedly re-training (possibly up to convergence) a large DNN over increasingly larger data samples.
A Brief Comparison with the Proposed Approach. The method proposed in this paper, named Play it straight, tries to overcome the limitations of all the methods mentioned so far by combining the best of Data Selection and Active Learning: it first exploit an RS2-based scheme to efficiently train (in quite a small number of epochs) a preliminary DNN model from the whole dataset, and then incrementally fine-tunes this model with the help of informative data samples selected in an iterative fashion according to an efficient AL-like criterion. The latter incremental fine-tuning phase is stopped as soon as the model reaches a pre-defined accuracy target, or once consuming the energy budget value specified by the user (this allows the algorithm to automatically halt as soon as the energy cost it has been spending surpasses the limit stated by the user. As shown empirically in Sect. 4, the original mix of features of Play it straight mentioned above allows it achieve remarkable computational savings without sacrificing model accuracy.
3 Proposed Approach
Let \(D\)\(\triangleq \{(x_{i},y_{i})\}_{i=1}^{n}\) be a given dataset which needs to be pruned, where each \(x_i \in X\) is a data instance and each \(y_i \in Y\) is the associated class label represented as a one-hot vector in \([0,1]^C\) (i.e. a vector containing only one non zero element equal to 1, indicating the class label) with \(C\in \mathbb {N}\) denoting the number of classes. Given a data budget \(b \in \mathbb {N}\), the goal of data pruning techniques is to extract some representative data summary \(D_s\)\(\subset \)\(D\) such that \(\vert D_s\vert \ll \vert D\vert \).
Let
be a DNN classification model parameterized by \(\theta \) that needs to learned, and \(l : Y^2 \times \theta \rightarrow \mathbb {R^+}\) be a continuous loss function that is twice-differentiable in model parameters \(\theta \). Notably, as \(D\) and \(D_s\) share the same data domain (X), under reasonable systems’ assumptions, training model
using gradient descent on \(D_s\) will enjoy a \(\frac{\vert D\vert }{\vert D_s\vert } \times \) speedup compared to training
on \(D_s\).
3.1 Algorithm Play it Straight
Algorithm 1 outlines our proposed approach, dubbed Play it straight. This name reflects our hybrid strategy: we efficiently train a DNN model “play it” over subsets of informative labeled instances extracted from a given large dataset \(D\) on the basis of “straight” (i.e. easy-to-calculate with minor computation overhead) importance scores. In addition to the dataset \(D\), Play it straight takes the following arguments: a neural network model
; the maximal numbers bootEpcs and ftEpcs of training epochs for the “boot” and “fine-tune” phase, respectively; a maximum energy budget B; a dissimilarity measure \(\textit{d}\) over Y; the number \(k\) of instances to select at each fine-tune round; and an instance ranking function \(f_{rank}\).
The algorithm consists of two phases in a sequence:
1.
Boot phase: First Play it straight exploits the fast convergence of the RS2 algorithm [16] to find an accurate enough preliminary setting for the parameters of
(as shown in [16], using RS2 algorithm to this end is more effective than performing the same number of optimization steps using a traditional SGD-like procedure). However, as shown in our experimental study, RS2 tends to experience a deceleration in accuracy gains. Thus, Play it straight continues training
according to an AL-like procedure, in order to revitalize the training process and eventually achieve better model performance with acceptable energy consumption;
2.
Fine-tune loop: Then, model
is made undergo an iterative fine-tune procedure, in each round of which \(k\) additional instances are selected from \(D\), based on their importance scores (as explained in the following) and incorporated into \(D_s\). During each fine-tune round, the model is updated and trained using both the newly-added instances and previously chosen ones. Throughout these rounds, an AL-oriented importance score is assigned to each instance x in \(D_u \triangleq D \setminus D_s\) by using the chosen ranking function \(f_{rank}\), combined with a dissimilarity measurement computed by applying the given function \(\textit{d}\) (e.g., euclidean distance or KL divergence) to the model’s output and the one-hot vector representing the ground-truth class of x. This allows us to assign an enhanced importance score (\(score^*\)) to x, which accounts for both the uncertainty and error associated with the prediction returned for x by the current version of the model being trained. The top-k instances from \(D_u\) are then selected based on these enhanced scores, added to \(D_s\), and removed from \(D_u\). Finally, the model is trained for ftEpcs epochs on the updated \(D_s\). By adopting such an iterative and adaptive process, in place of a one-shot data pruning approach, we can significantly decrease the computation and energy costs, especially in the initial rounds where quite few data instances are used for model training. On the other hand, since the iterative refinement procedure proposed here can be stopped as soon as the desired accuracy level is achieved, our approach looks more flexible than typical data pruning (and coreset selection) methods, which require the user to “guess the right data reduction level” allowing to achieve the desired accuracy while minimizing the computation cost.
Once completed the loop, the fully-trained version of model
is returned.
3.2 Setting Guidelines and Implementation Choices
Active learning (AL) offers a pathway to streamline AI model development while aligning with the principles of Green-AI. The core concept lies in the strategic selection of the most informative data samples from a larger labeled dataset. In principle, by using only such a small subset of samples for model training, the total computational costs needed to reach a predefined target accuracy level can be reduced. However, the amount of energy saving that can be obtained strongly depends on the following factors:
Data Reduction Effectiveness: A core measure of AL effectiveness is its ability to drastically reduce the training set size while preserving model performance. The greater the reduction achievable, the higher the potential energy savings;
Data Sampling Complexity: Data sampling methods proposed in AL literature differ a lot in their computational overhead. Simpler methods like uncertainty sampling have minimal cost, while more sophisticated approaches entail heavier compute. Indeed, using some computationally intensive AL technique may render the proposed method ineffective, because the selection process can become more burdensome than the neural network’s training;
Impact on Training Convergence: The interaction between data reduction and the model’s convergence behavior cannot be ignored. In some cases, a highly informative dataset might lead to fewer training iterations, amplifying savings. However, it’s also possible that more iterations might be required to converge, partially offsetting the energy gains.
In the current implementation of the approach, we have considered three alternative uncertainty-based criteria to instantiate the function \(f_{rank}\) attributing importance scores to the data instances, in order to incrementally select a subset of those achieving the highest scores:
Least Confidence (denoted hereinafter as lc, for short): Let p be the probability of the most likely class for a data instance x. Then the least confidence score assigned to x is simply computed as \(1-p\);
Margin sampling (referred to as margin from now on): This criterion focuses on the difference between the probability of the most likely class and the second most likely class. If, for a data instance x, \(p_{top1}\) and \(p_{top2}\) are the probability of the most likely class and of the second most likely class, respectively, then the margin score of x is computed as \(p_{top1} - p_{top2}\);
Entropy (simply denoted as entropy hereinafter): Entropy measures the overall uncertainty across all classes. A high entropy value indicates the model is unsure about the correct class. For a data instance x, if there are C classes and \(p_i\) is the probability of the i-th class, the entropy is calculated as \(-\sum _{i=1}^C{p_i\log {p_i}}\).
4 Experimental Evaluation
4.1 Test Bed
Datasets. We used the following datasets to execute the experimental evaluation:
CIFAR-10 [12]: which consists of 60000 instances representing 32\(\,\times \,\)32 colour images, labeled using 10 mutually exclusive classes, with 6000 images per class. The dataset is organized into 50000 instances as the training set and 10000 instances as the test set. The latter contains 1000 randomly-selected images from each class, while the training set is comprised of 5 training batches, each containing 5000 images per class;
CIFAR-100 [13]: which consists of 60000 instances representing 32\(\,\times \,\)32 colour images, labeled using 100 mutually exclusive classes, with 600 images per class. The dataset is organized into 50000 instances as the training set and 10000 instances as the test set. The latter contains 100 randomly-selected images from each class, while the training set is comprised of 5 training batches, each containing 500 images per class.
Terms of Comparison and Evaluation Setting. We benchmarked Play it straight against standard full-dataset training (referred to hereinafter as Standard train), the RS2 algorithm proposed in [16], the pure AL-based approach presented in [17] and state-of-the-art Data Pruning (DP) methods Glister, GraphCut, CRAIG, GraNd, by leveraging the respective implementations available in library DeepCore [8].
In each test, we evaluated each of these methods by measuring both the total amount of energy (measured in Wh) it consumed and the accuracy of the models discovered. Inspired by the time-to-accuracy analysis conducted in [16], we fixed different accuracy targets (namely, from 60% to 90% on CIFAR-10, and from 50% to 75% on CIFAR-100)1 and measured the amount of energy consumed by each method to reach each target —unless the method had exhausted its budget of energy/epochs before reaching the target.
All the experiments were run on an Intel Xeon CPU E5-2698 v4 @ 2.20GHz, 250GB RAM, with Tesla V100-DGXS-32GB GPU. Energy measurements were made by using library CodeCarbon, version 2.4.1 [4].
Hyperparameter Configuration. In each tests, a ResNet18 [9] classification model was trained by using a mini-batch Stochastic Gradient Descent (SGD) [21] (with learning rate = 0.1 and momentum 0.9) and Cross-Entropy loss. However, as observed in [16], typical learning rate schedules may not decay sufficiently fast to adequately train the given model in a data-pruning-based machine learning scenario. Thus, as proposed in [16], in each training session, we adapted the learning-rate decay schedule to the actual number of optimization steps performed in the session.
We tested Play it straight using three variants of the ranking function \(f_{rank}\) (cf. Algorithm 1), associating each data instance to its margin, entropy and lc (i.e. least confidence) scores, respectively (see Sect. 3.2 for a definition of these scores). However, it is important to note that our approach is flexible and can accommodate other AL techniques or combinations thereof. After testing various dissimilarity measures (including more costly ones, like KL divergence), we eventually decided to only show here the results obtained with the L2 distance, seeing as the other measures tested did not improve these results appreciably.
As to the specific configuration of algorithm Play it straight, in the boot phase, the RS2 procedure was always run with a data reduction factor (per epoch) of 30% (i.e. \(r=0.3\)) while fixing a maximum of 20 epochs (i.e. \(\textit{bootEpcs} =20\)); in each fine-tune round, we made Play it straight select 1000 instances for CIFAR-10 (\(k=1000\)) and 5000 instances (\(k=5000\)) for CIFAR-100, and run 10 optimization epochs (\(\textit{ftEpcs} =10\)) for CIFAR-10 and 5 epochs (\(\textit{ftEpcs} =5\)) for CIFAR-100.
The hyperparameters of RS2 [16] and the pure Active Learning (AL) method of [17] were favorably set following the papers in which they were proposed.
Specifically, the AL method was configured to perform 20 AL rounds, and to select 1000 data instances per round based on Margin scores; at each of these AL rounds, the model was re-trained from scratch, for 200 epochs, over all the data instances accumulated up to that moment as done in [17].
Algorithm RS2 was tested with different values of the reduction factor (namely 20%, 10% and 5%), considering a total budget of training epochs of 200, as proposed in [16].
4.2 Test Results
The analysis focuses on three key aspects: computational savings, accuracy, and pruning ratio, comparing the performance of Play it straight with the previously explained techniques. A significant advantage of Play it straight is its iterative approach to data selection: it eliminates the need to pre-determine the amount of data to prune. Instead, it dynamically adds only the data necessary to reach the target accuracy. The intelligent data selection of Play it straight enables it to reach the target accuracy more quickly, translating to computational savings despite potentially higher pruning ratio (\(pr\)) compared to other methods.
Table 1.
Energy consumption (Wh) required to achieve various target accuracies on the CIFAR-10 dataset using different techniques. Lower values indicate greater energy efficiency. The best method(s) is bolded.
Energy consumption (Wh) required to achieve various target accuracies on the CIFAR-100 dataset using different techniques. Lower values indicate greater energy efficiency. The best method(s) is bolded.
Target
50%
55%
60%
65%
70%
75%
Standard train
101
395
622
1375
1606
1837
AL (margin)
1057
1304
1870
3265
5000
-
RS2 w/o repl 20%
107
189
210
250
304
-
RS2 w/o repl 10%
112
124
146
167
-
-
RS2 w/o repl 5%
75
80
95
-
-
-
Play it straight (margin)
65
72
94
140
199
418
Play it straight (entropy)
65
72
105
150
212
527
Play it straight (lc)
65
70
103
149
212
362
Fig. 2.
Energy-to-accuracy for Play it straight compared to RS2, AL, and standard training of ResNet18 on the full CIFAR-10 dataset, targeting 90% (a) and 80% (b) accuracy. Values are reported every 10 epochs. Subfigures (1) showcases all techniques, while (2) focuses on the low-energy methods
Energy-to-accuracy for Play it straight compared to RS2, AL, and standard training of ResNet18 on the full CIFAR-100 dataset, targeting 75% (a) and 70% (b) accuracy. Values are reported every 10 epochs. Subfigures (1) showcases all techniques, while (2) focuses on the low-energy methods
As easily seen in Tables 1 and 2, Play it straight outperforms the other techniques in terms of computational savings given the same target accuracy, demonstrating the effectiveness of its iterative data selection strategy. This suggests that the “better” data selected by Play it straight not only speeds the training process up but also allows it to achieve all the considered target-accuracy levels. Notably, the latter nice property is not enjoyed by the other methods analyzed (excluding the standard train baseline, performing no kind of data reduction), which fail to meet some of the target accuracy thresholds.
Figure 2 for CIFAR-10 and Fig. 3 for CIFAR-100 illustrate the relationship between energy consumption (x-axis) and accuracy (y-axis). These figures clearly demonstrate that our proposed technique can achieve the target accuracy with a reduced energy consumption compared to standard training, AL, RS2 and so on. Notably, as dataset complexity increases, the energy savings achieved by our method become more pronounced across both target accuracy levels.
We note that while precision, recall, and F1 score were computed for a comprehensive evaluation, they are not detailed here for lack of space, considering that their trends closely mirror those of accuracy. This let us focus on accuracy as a representative metric without undermining the value of this empirical study.
5 Discussion and Conclusion
Based on our analysis, Play it straight emerges as an efficient method for training DNN models on large datasets. It delivers significant computational savings compared to standard training and AL approaches, without compromising model accuracy. In addition, Play it straight consistently outperforms other data pruning techniques in terms of energy consumption when considering different levels for the target accuracy to achieve. The computational efficiency of Play it straight makes it particularly well-suited for resource-constrained devices and aligns with the goals of Green AI, an increasingly important field in light of the climate crisis. Furthermore, its capacity to handle large datasets expands the potential applications of deep learning models, contributing to more efficient and sustainable systems.
Limitations. While Play it straight demonstrates promising results, it is important to acknowledge some limitations of it. First of all, the current implementation of Play it straight requires manual setting of different hyperparameters, including the number bootEpcs and ftEpcs of optimization epochs, the reduction factor r and the number \(k\) of instance to select at each fine-tune round. Improper choices for these hyperparameters may undermine the energy saving ability of Play it straight, especially if a too high number of AL rounds is required to reach the target accuracy. Additionally, the choice of the AL-like instance ranking function is critical, as it needs to strike a balance between energy efficiency and effectiveness in data selection. If the selected function is too computationally intensive, it may vanish part of the energy savings achieved through data reduction.
Future Work. To address these limitations, our future work will focus on several areas. First, we plan to investigate adaptive methods for tuning the above hyperparameters, to alleviate the burden of manual tuning and potentially improve Play it straight ’s performance across different scenarios. Second, we plan to conduct a comprehensive analysis to establish the boundaries within which Play it straight demonstrates superior performance compared to other techniques. This would provide valuable guidance to practitioners and researchers in selecting the most suitable algorithm for their specific use cases. Finally, we will explore other cheap data selection strategies, combined with model-training acceleration techniques (e.g., based on model pruning, Cutout regularization, low-precision parameter quantization), in order to further improve Play it straight ’s energy efficiency and effectiveness. Moreover, we will investigate replay-memory methods (like those commonly used in Reinforcement Learning and Continual Learning), to shrink the amount of previously-gathered data in each AL-like round; the Prioritized Experience Replay method proposed by Schaul et al. [25] looks a promising solution in this perspective.
This work was partially supported by research project FAIR (PE00000013), funded by the EU under the program NextGeneration EU.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
These ranges were chosen differently, starting from the different accuracy scores that the Standard train, energy-unaware, baseline obtained on the two datasets.
1.
Ash, J.T., Goel, S., Krishnamurthy, A., Kakade, S.M.: Gone fishing: neural active learning with fisher embeddings. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, 6–14 December 2021, virtual, pp. 8927–8939 (2021)
2.
Ayed, F., Hayou, S.: Data pruning and neural scaling laws: fundamental limitations of score-based algorithms (2023)
3.
Chang, H.-S., Learned-Miller, E.G., McCallum, A.: Active bias: training more accurate neural networks by emphasizing high variance samples. In: Neural Information Processing Systems (2017)
4.
Courty, B., et al.: mlco2/codecarbon: v2.4.1 (2024)
5.
de Vries, A.: The growing energy footprint of artificial intelligence. Joule 7(10), 2191–2194 (2023)CrossRefMATH
6.
Flesca, S., Scala, F., Vocaturo, E., Zumpano, F.: On forecasting non-renewable energy production with uncertainty quantification: a case study of the Italian energy market. Expert Syst. Appl. 200, 116936 (2022)CrossRef
7.
Garcia-Martin, E., Rodrigues, C.F., Riley, G., Grahn, H.: Estimation of energy consumption in machine learning. J. Parallel Distrib. Comput. 134, 75–88 (2019)CrossRefMATH
8.
Guo, C., Zhao, B., Bai, Y.: Deepcore: a comprehensive library for coreset selection in deep learning. In: Database and Expert Systems Applications: 33rd International Conference, DEXA 2022, Vienna, Austria, 22–24 August 2022, Proceedings, Part I, pp. 181–195. Springer, Heidelberg (2022)
9.
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
10.
Iyer, R., Khargoankar, N., Bilmes, J., Asanani, H.: Submodular combinatorial information measures with applications in machine learning. In: Feldman, V., Ligett, K., Sabato, S. (eds.) Proceedings of the 32nd International Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research, vol. 132, pp. 722–754. PMLR, 16–19 Mar 2021 (2021)
11.
Katharopoulos, A., Fleuret, F.: Not all samples are created equal: deep learning with importance sampling. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 2525–2534. PMLR, 10–15 July 2018 (2018)
12.
Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-10 (Canadian institute for advanced research)
13.
Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-100 (Canadian institute for advanced research)
14.
Loo, N., Hasani, R., Amini, A., Rus, D.: Efficient dataset distillation using random feature approximation. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022)
15.
Mirzasoleiman, B., Bilmes, J., Leskovec, J.: Coresets for data-efficient training of machine learning models. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 6950–6960. PMLR, 13–18 July 2020 (2020)
16.
Okanovic, P., et al.: Repeated random sampling for minimizing the time-to-accuracy of learning. In: The Twelfth International Conference on Learning Representations (2024)
17.
Park, D., Papailiopoulos, D., Lee, K.: Active learning is a strong baseline for data subset selection. In: Has it Trained Yet? NeurIPS 2022 Workshop (2022)
18.
Patterson, D.A., et al.: Carbon emissions and large neural network training. arXiv, abs/2104.10350 (2021)
19.
Paul, M., Ganguli, S., Dziugaite, G.K.: Deep learning on a data diet: finding important examples early in training. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems (2021)
20.
Quercia, A., Morrison, A., Scharr, H., Assent, I.: SGD biased towards early important samples for efficient training. In: 2023 IEEE International Conference on Data Mining (ICDM), pp. 1289–1294 (2023)
21.
Ruder, S.: An overview of gradient descent optimization algorithms (2017)
22.
Sachdeva, N., McAuley, J.J.: Data distillation: a survey. CoRR, abs/2301.04272 (2023)
23.
Salehi, S., Schmeink, A.: Is active learning green? An empirical study. In: 2023 IEEE International Conference on Big Data (BigData), pp. 3823–3829, Los Alamitos, CA, USA. IEEE Computer Society (2023)
24.
Scala, F., Flesca, S., Pontieri, L.: Data filtering for a sustainable model training. In: Proceedings of the 32nd Symposium of Advanced Database Systems, Villasimius, Italy, June 23rd to 26th, 2024. CEUR Workshop Proceedings, vol. 3741, pp. 205–216. CEUR-WS.org (2024)
Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. arXiv preprint arXiv:1708.00489 (2017)
27.
Settles, B.: Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences (2009)
28.
Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. In: Korhonen, A., Traum, D.R., Màrquez, L. (eds.) Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, 28July–2 August 2019, Volume 1: Long Papers, pp. 3645–3650. Association for Computational Linguistics (2019)
29.
Wang, K., et al.: Cafe learning to condense dataset by aligning features. In: Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12186–12195, United States (2022)
30.
Xu, J., Zhou, W., Fu, Z., Zhou, H., Li, L.: A survey on green deep learning. arXiv, abs/2111.05193 (2021)
31.
Yang, Z., Yang, H., Majumder, S., Cardoso, J., Gallego, G.: Data pruning can do more: a comprehensive data pruning approach for object re-identification. Trans. Mach. Learn. Res. (2024)
32.
Yu, R., Liu, S., Wang, X.: Dataset distillation: a comprehensive review. IEEE Trans. Pattern Anal. Mach. Intell. 46(01), 150–170 (2024)CrossRefMATH
33.
Zhao, B., Mopuri, K.R., Bilen, H.: Dataset condensation with gradient matching. In: International Conference on Learning Representations (2021)