Two parameters are critical in ML models: (1) the model parameters, which can be initialized and updated during the learning process; and (2) the HPs, which cannot be estimated directly from data learning and must be set prior to training a ML model – because they define the model’s architecture [
117]. Understanding which HP is required for a given task is critical in a variety of scenarios, ranging from experimental design to automated optimization processes. The traditional method, which is still used in research but requires knowledge of the ML algorithm’s HP configurations, entails manually tuning the HP until the desired result is achieved [
11]. This is ineffective in some cases, particularly for complex models with non-linear HP interactions [
118]. Numerous circumstances may necessitate the use of HPO techniques [
119]; we highlight few of them below, specifically focusing on DFAI tasks.
1.
Conducting a digital forensic investigation requires an inordinate amount of time, and minimizing this time has been a primary focus of research in this domain for years. Similarly, machine-driven techniques can be time consuming, depending on the size of the dataset or the number of HPs. Applying AI techniques on already complicated forensic investigations almost always adds complexity. HPO can significantly reduce the amount of human effort required to tune these HPs, hence considerably shortening the entire forensic analysis time.
2.
We have already highlighted the importance of performance in the context of DFAI procedures. ML methods require a range of HP settings to obtain optimal performance on a variety of datasets and problems. Numerous HPO techniques exist to assist in optimizing the performance of AI-based models by searching over different optimization spaces in quest of the global optimum for a given problem.
3.
As previously stated, reproducibility is a necessary condition for a standard DF technique. HPO can assist in a variety of ways in achieving this goal. When evaluating the efficacy of several AI algorithms on a certain analysis, for example, adopting the same HP settings across all models establishes a fair comparison process. This can also be used to determine the optimal algorithm for a particular problem. Reporting these HP configurations can be advantageous in the event of DFAI model replication.
As with conventional AI models, when developing a DFAI model with HPO in mind, the process will include the following: an estimator (a classifier or regressor) with its objective function, a search (configuration) space, an optimization method for identifying suitable HP combinations, and an evaluation function for comparing the performance of various HP configurations [
118]. A typical HP configuration can be continuous (e.g., multiple learning rate values), discrete (e.g., the number of clusters, k), binary (e.g., whether to use early stopping or not), or categorical (type of optimizer), all of which can be combined to produce an optimized model. Because the majority of ML algorithms have well-defined open-source frameworks (such as scikit learn
12) that can assist in solving problems by tuning (changing values) some already pre-set HPs, we will focus on HPOs related to DL models because they require self/auto-tuning of un-set parameters. HP in DL are set and tuned according to the complexity of the dataset and the task, and they are proportional to the number of hidden layers and neurons in each layer [
120]. The initial parameter setting for a DL model is to specify the loss function (binary cross-entropy [
121], multi-class cross-entropy [
122], or squared error loss) appropriate for the problem type. Then comes the type of activation function (e.g., ReLU [
123], sigmoid,
13 etc.) that describes how the weighted sum of the input is transformed into the output. Finally, the optimizer type is specified, which may be stochastic gradient descent (SGD) [
124], Adaptive Moment Estimation (Adam) [
125], or root mean square propagation (RMSprop) [
126]. In what follows, we describe several optimization techniques that can be vital to the optimization of a DFAI model.
4.1 Methods for Hyper-Parameter Optimization in DFAI
A. Trial and error method
This method involves tuning parameters manually. It entails testing a large number of HP values based on experience, guesswork, or analysis of prior results. The approach is to improve parameter guesses iteratively until a satisfying result is obtained. This approach may be impractical for a variety of issues, particularly those involving DF analysis, that could involve large number of HP or complex models [
118]. However, this technique can improve interpretability by allowing for the assessment of the model’s various working parts as the parameters are tuned.
B. Grid search (GS) This is a frequently used tech nique for exploring the HP configuration space [
127]. It does a parallel search of the configuration space and is suitable within a limited search space; otherwise, it may suffer from the “curse of dimensionality” [
129]
When DF examiner has sufficient knowledge about the (finite) set of HP to specify [
95] for the search space, GS is preferable. Because computational intensity is one of GS’s drawbacks [
128], its usage in DFAI is mostly focused on comparing the performances of many ML algorithms [
169] in order to identify which one achieves the best performance on a certain forensic task. The authors in [
130] described a botnet detection method using GS optimization techniques.
C. Randon search (RS)
RS was proposed in [
131] as a way to circumvent GS’s limitations. Unlike GS, however, RS randomly selects a predefined number of candidate samples between a specified upper and lower bound and trains them until the budget is exhausted or the target accuracy is reached. It does this by allocating resources to best-performing regions with parallelization [
132].
Due to the simplicity with which RS parallelizes, it is an ideal choice for DFAI tasks involving convolutional networks (CNN) [
133], such as multimedia forensics (e.g., sound and video), image forensics, and so on, in which (low-dimensional) features are mapped from one layer to the next. This method can be time and memory-intensive. To optimize the process, a batching strategy [
135] is implemented that takes advantage of the batch size and learning rate to reduce training time without compromising performance. In this case, RS may be useful in terms of determining the ideal range of values for these parameters [
134], as just the search space must be specified. Additionally, RS’s use in optimizing multimedia forensics analysis suggests that it may be key for recurrent neural networks (RNN) [
136], although RS has the disadvantage of not taking past results into account during evaluation [
118]. As a result, using RS in recursive tasks such as event reconstruction in DFAI may result in less-than-optimal outcomes.
D. Gradient descent (GD)
The gradient descent [
137] optimization computes the gradient of variables in order to determine the most promising path to the optimum. Gradient-based optimization techniques converge faster to the local minimum than the previously described techniques, but they are only applicable to continuous HP, such as the learning rate in NN [
138], as other types of HP (e.g., categorical) lack gradient direction. The application of GD in DFAI approaches is almost ubiquitous, as it is used in virtually all DL models. It is one of the most straightforward optimization architectures to understand and interpret. However, the findings published in [172] proved the occurrence of “Catastrophic Forgetting” when gradient descent is used, particularly for reproduction. That is, when trained on a new task, ML models may forget what they learned on a previous task with only gradient descent. A combination with dropout [
172] is recommended, however.
E. Bayesian Optimization (BO)
BO [
139,
140] is an iterative algorithm that calculates future evaluation points based on the prior results. It is a typical model for all sorts of global optimization, with the goal of becoming less incorrect with more data [
141]. BO identifies optimal HP combinations faster and it is applicable regardless of whether the objective function is stochastic, discrete, continuous, convex, or non-convex. Gaussian process (GP) [
142], Sequential Model-based algorithm configuration (SMAC) [
143], and Tree Parzen Estimator (TPE) [
144] are an examples of common BO algorithms. BO is especially useful in tools like the Waikato Environment for Knowledge Analysis (WEKA) [
145], an open-source tool with collections of ML and data processing algorithms. Numerous DF analyses methods [
146‐
148] have been proposed or conducted using WEKA—leveraging its robust data mining capabilities and the possibility to choose from, or compare a variety of extensible, base learning algorithms for a specific forensic task. Selecting the right algorithm and HPs for optimal performance and accuracy in a WEKA-based DFAI analysis might be challenging. In this case, the excellent properties of BO can aid in choosing the optimal ML method and HP settings that minimizes analytical errors.
The works presented in [
149] and [
150] demonstrates how BO can be used (more precisely, with SMAC and TPE) as meta-learning to guide the choice of ML algorithms and HPO settings that outperform conventional selections on a classification task.
F. Multi-fidelity optimization (MFO)
MFO techniques are frequently used to overcome the time constraints limitations imposed by other HPO due to huge configuration space and datasets. MFO evaluates practical applications by combining low and high-fidelity measures [
151]. In low-fidelity, a relatively small subset is evaluated at a low cost and with poor generalization performance; while in high-fidelity, a larger subset is examined at a higher cost and with improved generalization performance [
152].
MFO techniques include “bandit-based” [
153] methods that allocates computational resources to the “best-arm” (most promising) HP settings. Successive halving (SHA) and Hyperband (HB) are the two most often used bandit-based algorithms [
152,
154].
The application of MFO techniques to DFAI can be exemplified with transfer learning (TL) [
155], which is the process by which previously stored knowledge is used to solve different but related problems. TL has been deployed in a variety of DFAI methods [
156,
157], most notably on image forensics and detection problems using labeled samples. Thus, low or high fidelity optimization can be helpful for determining the optimal solution depending on the size of the stored knowledge (dataset), the investigative problem, and available computational resources. [
158] describes an example of work on detecting (signature-based and unknown) malware-infected domains based on HTTPS traffic, using TL and optimized with Hyperband optimization. Additionally, a state-of-the-art HPO technique called Bayesian Optimization Hyperband (BOHB) [
159], which combines BO and HB to maximize the benefits of both, is gaining attention, and it will be interesting to see how DF research employs this promising technique in the future.
G. Metaheuristic algorithms
Metaheuristic algorithms are a popular type of optimization technique that are primarily inspired by biological evolution and genetic mutations. They are capable of resolving problems that are not continuous, non-convex, or non-smooth [
118]. Population-based optimization algorithms (POAs) [
160] are an excellent example of metaheuristic algorithms since they update and evaluate each generation within a population until the global optimum is found. The two most frequently utilized types of POA are genetic algorithms (GA) [
161] and particle swarm optimization (PSO) [
162]. PSO, specifically, is an evolutionary algorithm that functions by allowing a group of particles (swarm) to traverse the search space in a semi-random fashion [
116], while simultaneously discovering the optimal solution through information sharing across swarms.
Network forensics with DL is an ideal use for PSOs, as training such models can be time-consuming since it requires identifying complex patterns from large amounts of data. To detect network intrusion or attack, iterative reverse-engineered codes on parser and network traffic logs are required; this can be challenging for humans [
163]. The work described in [
163] shows the efficacy of PSO as a useful instrument to minimize/maximize an objective function, and to determine the optimal HPs (such as epochs, learning rate, and batch size) that contribute to the deep forensic model’s AUC accuracy and the reduction in false alarm rate.
Table 2
The comparison of HPO techniques (n denote the number of HP values and k is the number of HP)
GS | | | \(O(n^{k})\) |
| * Simple | * Effective with categorical HP | |
| * Independent (Parallelization) | * Time consuming | |
| * Exhaustive use of the search space | * HP grows exponentially | |
| | * Possible overfitting | |
RS | | | O(n) |
| * Effective parallelization | * Less-effective with conditional HP | |
| * Improvement over GS | * Ignores previous result during evaluation | |
| * Better with low-dimensional data | * Potential for variance since it is random | |
| * Reduce overfitting | | |
| * No HP tunning except for specifying search space | | |
GD | | | \(O(n^{k})\) |
| * Fast convergence speed for continuous HP such as learning rate | * Support only continuous HP | |
| | * Detects only a local optimum | |
BO(BO-GP, SMAC, BO-TPE) | | | \(0(n^{k})\) (BO-GP), O(nlogn) (SMAC, BO-TPE) |
| * Fast convergence speed for continuous HP | * Poor parallelization capacity | |
| * Effective with all types of HP (in SMAC and BO-TPE cases) | * Slow convergence with dimension \(>1000\) | |
| * Computes mean and variance | * Specification of prior is difficult | |
HP | | | 0(nlogn) |
| * Better parallelization | * Less-effective with conditional HP | |
| | * Subset with small budget required | |
BO-HP | | | 0(nlogn) |
| * Effective with all types of HP | * Subset with small budget required | |
| * Better parallelization | | |
GA | | | \(0(n^{2})\) |
| * No initialization | * Poor parallelization capacity | |
| * Effective with all types of HP | * Computational complexity | |
| * Produces multiple optimal solutions | | |
| * Possible global optimal solution | | |
| * Large solution space | | |
| * Support multiple objective function | | |
PSO | | | 0(nlogn) |
| * Better parallelization | * Initialization required | |
| * Effective with all types of HP | * Weak local optimum search space | |
| * Efficient global search algorithm | | |
| * Insensitive to caling of design variables | | |
4.2 General Discussion on HPO in DFAI
It is worth emphasizing that the techniques discussed here are by no means exhaustive in terms of definition, components, and applicability. These few are chosen for their popularity and as a means of briefly discussing optimization techniques in the context of DFAI models. As such, in depth discussions about HPOs are available in [
114,
118]. In general, depending on the size of the data, the complexity of the model (e.g., the number of hidden layers in a neural network (NN) [
164‐
166] or the number of neighbours in a
\(k-\)Nearest Neighbors (KNN) [
167,
168]), and the available computational resources, an HP configuration may lengthen the time required to complete a task. Further along this line, in most cases, only a few HP have a substantial effect on the model’s performance in ML methods [
118]. As such, having many HP configurations exponentially increases the complexity of the search space. However, with DL, HPO techniques will require significant resources, particularly when dealing with large datasets. Considering all of these complexities, especially in the context of DFAI, where timeliness, transparency, and interpretability are critical, a well-chosen HPO technique should aid in rapid convergence and avoid random results. However, given that DF analysis are case-specific, often distinctive, with interpretability as a fundamental requirement, decomposing complexity should be a priority. Thus, unless forensic investigators have sufficient computing resources and a working knowledge of the parameter settings for various HPO techniques, they may choose to consider the default HP settings in major open-source ML libraries, or make use of a simple linear model with reduced complexity, where necessary. In case of a self-defined DNN model, basic HP settings and early stopping techniques can be considered. Finally, to summarize the various HPO algorithms mentioned thus far, table
2 compares these HPO algorithms and their respective strengths and drawbacks, as adapted from [
118] but extended with additional inputs.