Fish fin rays constitute a sophisticated control system for ray-finned fish, facilitating versatile locomotion within complex fluid environments. Despite extensive research on the kinematics and hydrodynamics of fish locomotion, the intricate control strategies in fin-ray actuation remain largely unexplored. While deep reinforcement learning (DRL) has demonstrated potential in managing complex nonlinear dynamics; its trial-and-error nature limits its application to problems involving computationally demanding environmental interactions. This study introduces a cutting-edge off-policy DRL algorithm, interacting with a fluid–structure interaction (FSI) environment to acquire intricate fin-ray control strategies tailored for various propulsive performance objectives. To enhance training efficiency and enable scalable parallelism, an innovative asynchronous parallel training (APT) strategy is proposed, which fully decouples FSI environment interactions and policy/value network optimization. The results demonstrated the success of the proposed method in discovering optimal complex policies for fin-ray actuation control, resulting in a superior propulsive performance compared to the optimal sinusoidal actuation function identified through a parametric grid search. The merit and effectiveness of the APT approach are also showcased through comprehensive comparison with conventional DRL training strategies in numerical experiments of controlling nonlinear dynamics.
Hinweise
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Finned fish demonstrate extraordinary mobility by exploiting the innate flexibility and curvature of their body and fins, in contrast to the majority of man-made watercraft, which rely on propeller-driven propulsion. Through millions of years of evolutionary refinement, finned fish have developed oscillatory locomotion characterized by remarkable propulsion efficiency, maneuverability, and minimal noise generation [1, 2].
Despite extensive research on the kinematics and hydrodynamics of fish swimming over the years [1, 3‐6], the optimal control strategies for fin ray actuation largely remains elusive, primarily due to the intricate complexities arising from inherent flexibility and curvature of fish bodies and fins, coupled with their nonlinear interactions with the complex fluid environment.
Anzeige
Understanding these strategies is crucial for the development of bio-inspired soft robotic systems. While advances in computational fluid dynamics (CFD) and hydrodynamic experiments have enabled more detailed investigation into the underlying fluid–structure interaction (FSI) physics [7‐9], several challenges persist in comprehending the active control mechanism including (i) the highly nonlinear characteristics of the FSI system make the classic linearization-based control methods unsuitable, (ii) the continuum spatiotemporal and actuation parameter spaces result in an extremely high-dimensional control space, (iii) the considerable computational expense associated with simulating the interacting physics between flexible structures and complex fluid dynamics.
Deep reinforcement learning (DRL) has emerged as a promising approach for tackling highly non-linear dynamic control problems characterized by high-dimensional state-action spaces, as evidenced by recent advances in the field [10‐14]. DRL leverages deep neural networks (DNNs) as the foundation for the control policy, enabling the agent to learn optimal actions through repeated interactions with the environment. In recent years, DRL has proven effective in managing various complex fluid dynamic systems across diverse scenarios, such as laminar and turbulent flows [15‐21], vortex shedding [15, 17, 18, 22‐24], and fish swimming [25‐33]. DRL’s proficiency in handling high-dimensional control space can be attributed to its ability to learn complex mappings between states and actions through the use of DNNs. Furthermore, modern DRL techniques, such as experience replay and target networks [34, 35], enhance stability and convergence during training, thereby improving its efficacy in addressing challenging control problems.
Despite the potential and initial successes of DRL in managing high-dimensional, non-linear systems, substantial challenges remain due to the high computational costs associated with high-fidelity (HF) simulations, particularly in the context of fluid–structure interactions. The trial-and-error nature of DRL requires a significant number of interactions with the environment, and each interaction involves numerically simulating FSI dynamics, such as in fish fin-ray control, making direct training of a DRL agent prohibitively expensive. Therefore, developing an efficient DRL solution capable of handling these computational demands is crucial for advancing the application of DRL to complex fluid/FSI dynamics.
To reduce the training cost of DRL in controlling fish locomotion or schooling, several studies have explored the utilization of fast surrogate models. This approach allows DRL agents to interact with these approximations, circumventing the necessity for direct training in computationally expensive HF simulated environments. A common practice involves leveraging low-fidelity (LF) numerical simulations, which rely on reduced dimensions and (over)simplified physics, providing a computationally efficient alternative for DRL training. For example, Gazzola et al. [30] employed a pair of vortex dipoles to model swimmers, while Novati et al. [29] utilized a sinusoidal function to describe the swimmer’s body curvature and prescribe the motion, avoiding the need for two-way coupled FSI. Some other studies directly neglected the shape of the swimmers and their influence on the surrounding fluids [25, 36]. Another promising strategy is to actively construct a DNN-based surrogate model for the environment during DRL training, known as model-based reinforcement learning (MBRL). The MBRL approach takes advantage of the fast inference speed of DNN surrogates, allowing numerous interactions with the learned environment. Notably, Liu et al. [37] developed a physics-informed MBRL, introducing physics constraints in the MBRL training, leading to enhanced learning performance.
Anzeige
In LF simulated or DNN-learned environments, the numerous iterations required by DRL become manageable, and the learned policy will be subsequently applied to the target HF environment. For example, Verma et al. [26] utilized a two-dimensional (2D) LF numerical model to train the DRL agent, subsequently applying the learned policy to control a fish-like swimmer in the target HF environment based on three-dimensional (3D) direct numerical simulation (DNS). However, due to the notable differences between the training and target environments, the control policy obtained from the LF environment often falls short of achieving optimal performance in the target environment. While refining the LF-trained DRL agent in the target environment can enhance performance, the overall reduction in training costs, considering both the overhead of LF-based pre-training and subsequent fine-tuning in the target environment, remains a subject of debate. Some other studies have chosen to directly train their DRL agents using real experimental data [18, 32], but this approach proves challenging for studying fish locomotion and swimming, given the impracticality or high difficulty associated with experimentally controlling real fish or manufacturing fish-like soft-body robots. Although previous work [38] illustrates the potential of utilizing DRL for experiment design, the challenges associated with training DRL models in real-world experiments persist. Therefore, direct training of a DRL agent in a computationally expensive HF simulated environment is sometimes necessary, particularly for studying fish fin-ray control involving complex nonlinear FSI dynamics, which are highly sensitive to the actions of the control agent.
To accelerate RL training in computationally demanding environments, a viable and effective approach is to simulate multiple environments concurrently. The success of this strategy has been demonstrated by Rabault et al. [39]. While the overall training time was reduced, indiscriminately running hundreds of environments can be inefficient, especially considering the heterogeneous hardware commonly employed in RL training. Furthermore, Rabault et al. [39] coupled parallel training environments with an on-policy DRL algorithm, Proximal Policy Optimization (PPO) [40]. This method restricts training to interactions based on the current policy network, and its implementation requires the DRL agent to be trained after all the environments have completed their tasks. These constraints limit the potential advantages of running multiple environments in parallel. Previous studies have endeavored to enhance the suboptimal efficiency arising from concurrently simulating multiple environments, attributed to significant variance in simulation times across different environments by starting new agent-environment interactions in an asynchronous manner [41, 42]. Similar ideas also emerged in supervised learning where multiple distributed data sources generate training datasets simultaneously [43]. However, it is important to note that despite simulating multiple environments asynchronously, the training of the policy network remains coupled with environment simulations. Consequently, the update of policy networks update is contingent on the completion of environment simulations, which leads to suboptimal efficiency in DRL training. While similar ideas of using supervised-learning
In this work, we propose a novel DRL training strategy, Asynchronous Parallel Training (APT), designed specifically to accelerate off-policy deep reinforcement learning efficiently and stably in computationally demanding environments, such as FSI dynamics for flexible fin-ray propulsion. APT revolves around the core concept of optimizing the utilization of heterogeneous hardware resources by harnessing asynchronous operations between CPUs and GPUs. By eliminating the need for synchronization between these computing units, APT effectively minimizes idle time and alleviates bottlenecks associated with conventional synchronous training approaches. This approach, in turn, significantly enhances overall training efficiency and speed in complex simulation environments, enabling faster convergence of the learning process. We successfully apply the APT method to two fin-ray control tasks: maximizing thrust and maximizing propulsion efficiency, achieving better performance compared to baseline methods. Our results illustrate the potential of APT as an effective solution for complex DRL tasks in computationally demanding scenarios. Additionally, we introduce a transfer learning-inspired technique named Global Searching and Local Fine-tuning (GSLF), designed to improve the performance and stability of DRL agents, particularly in the task of maximizing efficiency. The remainder of this paper is structured as follows. In Sect. 2, we provide a detailed description of the APT-based off-policy DRL methodology, outlining its key components and operation. Section 3 presents the numerical results obtained for both the thrust maximization and efficiency maximization tasks. Further experimental findings on the performance of the APT method, along with an exploration of reward function choices, are discussed in Sect. 4. Finally, Sect. 5 concludes the paper.
2 Methodology
2.1 The simulated FSI environment
In this work, we employ DRL to explore control strategies for fin-ray actuation in the fish-fin propulsion within a simulated FSI environment. (Further validation of the FSI solver can be found in existing literatures [44, 45]) Illustrated in Fig. 1a, fish fin ray is characterized by a unique bilaminar structure, consisting of the intraray, made primarily of soft tissue, and the bony hemitrichs that encapsulate the intraray. The bilaminar nature of this structure enables real-time control of each ray through antagonistic muscle actuation at the base of the ray, causing a displacement offset of two hemitrichs. This mechanism allows for the generation of intricate stiffness and curvature variations across the entire fin in space and time.
Fig. 1
Schematics of a the fin-ray deformation with muscle actuation by applying offset of \(\varepsilon \); b the fin-ray root motions of pitching, plunging, and muscle actuation
×
In the simulated FSI environment, the muscle actuation is represented by applying the offset of \(\varepsilon = \Delta x/L\) to the root of each hemitrich, where \(\Delta x\) is the root displacement of the hemitrich and \(L = 4 cm\) is the length of the fin ray. The detailed material properties and dimensions of the ray model can be found in [46].
To faithfully replicate the biomechanical dynamics observed in natural fish locomotion, our simulation incorporates prescribed pitching and plunging motions within the fin ray model. As depicted in Fig. 1b, the kinematics of the ray are governed by a synergistic interaction between the root’s pitching-plunging motions and subsequent bending introduced by the root displacement \(\varepsilon \). This kinematic scheme is informed by high-resolution photogrammetric analyses of fish swimming [44], which have revealed that the fin-ray root undergoes periodic motions described by the functions \(\beta (t)\) for pitching and h(t) for plunging, occurring with a 90-degree phase shift,
Fig. 2
Boundary conditions of the flow solver and near body computational grids
×
$$\begin{aligned} \begin{aligned} \beta (t)&= \beta _0 \sin {(2\pi f t)},\\ h(t)&= h_0 \sin {(2\pi f t + \pi /2)}, \end{aligned} \end{aligned}$$
(1)
where two constants, \(\beta _0 = 0.392\) and \(h_0 = 0.25\) are derived from the photogrammetry [44]; the beating frequency is denoted as \(f=2\) Hz. The upstream flow velocity is set as \(v=10 \mathrm {{cm\cdot s^{-1}}}\), resulting in a Strouhal number (\(St = 2 h_0 f/v\)) of 0.4, falling within the natural range (\(0.2< St <0.4\)) typically observed in aquatic environment [47]. Additionally, a kinematic viscosity value of \(\nu = 1.084\times 10^{-6}\) is chosen, yielding a Reynolds number (\(Re = vL/\nu \)) of 3690, which is in moderate range of fish swimming with a high viscous effect [48].
In the current study, the simulation environment employs an in-house FSI solver which couples a sharp interface immersed boundary method based incompressible flow solver with a finite element method based solid dynamics solver [49]. This flow solver incorporates a multi-dimensional ghost-cell methodology, adept at handling the complexities of moving boundaries with a second-order accuracy both globally and in proximity to the immersed boundary.
Figure 2 illustrates the computational domain and boundary conditions of the simulated FSI environment. The domain is discretized using a grid of \(112 \times 97\) Cartesian cells with the finest grid size of \(0.038L \times 0.038L\) near the fin-ray to resolve the near-field vortex structures. The left boundary is set as a velocity inlet with an upstream velocity v, while the top and bottom boundaries are treated as moving walls, synchronized with the velocity v. The right boundary is defined with a zero pressure and zero velocity gradient condition. For detailed simulation setup and FSI solver, please refer to [46].
2.2 Deep reinforcement learning
In reinforcement learning (RL), a RL agent is tasked with learning an optimal control strategy or policy \(\pi \) from its experience of interacting with the environment. This learning process involves the agent iteratively interacting with its environment and making decisions (\(\varvec{a}_i\)) at each control step (i) based on its observations (\(\varvec{o}_i\)) of the current environment state (\(\varvec{s}_i\)). After executing an control action (\(\varvec{a}_i\)), the environment returns the new state (\(\varvec{s}_{i+1}\)) and a reward (\(r_i\)), which serves as a feedback signal for the action taken. This interaction process can be mathematically described as follows,
where \(f_O\) denotes the observation function, \(f_r\) is the reward function, and \(\mathscr {F}\) represents the dynamics of the environment. The objective of the RL agent is to maximize the cumulative reward over an episode, which is typically composed of a sequence of control steps.
In the context of deep reinforcement learning (DRL), the policy \(\pi \) is learned by deep neural networks, formulated as,
where \(\pi _{\varvec{\theta }}\) symbolizes the policy network parameterized by trainable weights \(\varvec{\theta }\). The training in DRL is an optimization problem aimed at maximizing the expected cumulative reward, expressed as,
where R is the expected return, and \(\gamma \), the discount factor between 0 and 1, reflects the preference for immediate rewards over future rewards. In our implementation, we adopt \(\gamma =0.99\), aligning with standard practices in DRL [50, 51]. In general, DRL algorithms can be divided into two categories: on-policy and off-policy, based on the source of interaction data used for updating the neural networks. On-policy algorithms rely on data derived from the current policy, whereas off-policy algorithms utilize historical experiences, which typically results in greater sample efficiency. The focus of the proposed method is to enhance the training efficiency of off-policy algorithms further. Off-policy algorithms are characterized by their use of a replay buffer, denoted as \(\mathscr {D}\), to store past interaction experiences \(e_i\). Each interaction experience in this context is a tuple comprising the current state \(\varvec{s}_i\), the action \(\varvec{a}_i\) taken by the RL agent based on this state, the next-step state \(\varvec{s}_{i+1}\) resulting from the action, and the associated reward \(r_i\), formally represented as \(e_i = (\varvec{s}_i, \varvec{a}_i, \varvec{s}_{i+1},r_i)\). The neural networks are then updated using the data accumulated in the replay buffer, as detailed in Algorithm 1.
Algorithm 1
Conventional off-policy reinforcement learning
×
2.3 Enhancing RL training efficiency through asynchronous parallel training
In modern DRL frameworks, the training typically involves a division of labor between CPUs and GPUs. CPUs are generally tasked with simulating the environment dynamics, while GPUs are dedicated to the process of updating neural network parameters. Although these heterogeneous computing units are used, conventional DRL algorithms predominantly adhere to a synchronized operational model, wherein CPUs and GPUs alternate their activities, leading to periods of inactivity as one waits for the other to complete its task, as depicted in Fig. 3a.
Fig. 3
Time consumption schematics of three different RL training strategies
×
This synchronized approach results in suboptimal utilization of the heterogeneous hardware, introducing significant latency and inefficiencies, thereby limiting the overall performance and leading to the underutilization of computing resources during synchronization periods. Running multiple environments in parallel can enhance CPU utilization for certain tasks, as noted in [39]. However, this method still remain inefficient, particularly when the time cost associated with different environments varies significantly, a common scenario in simulating complex FSI problems, as shown in Fig. 3b. This inefficiency persists even in scenarios where only CPUs are employed for both environment interaction and neural network training, as the processing time is dictated by the slowest environment.
To mitigate these inefficiencies and to leverage the full potential of heterogeneous computing systems in DRL, we introduce the Asynchronous Parallel Training (APT) algorithm.
Algorithm 2
Asynchronous Parallel Training (APT) for off-policy reinforcement learning
×
APT overhauls the training process by decoupling environment simulation, performed by CPUs, from neural network training, carried out by GPUs. This strategy is depicted in Fig 3c, where the asynchronous nature of APT is evident: CPUs continuously simulate multiple environment interactions without waiting for GPUs to complete training epochs, and vice versa. This approach enables simultaneous operations, eliminating idle times that previously characterized CPU-GPU interdependence. The asynchronous operation allows for a non-blocking workflow where CPUs can process subsequent environment interactions while GPUs concurrently optimize neural network parameters, leading to a significant reduction in total training time and maximizing the utilization of available computational resources. APT ensures active engagement of CPUs and GPUs, enhancing the training pipeline’s efficiency. The implementation details of APT, which include the scheduling of tasks between CPUs and GPUs, the management of the replay buffer, and the updating protocols for neural networks, are comprehensively detailed in Algorithm 2. In the presented APT framework, environment resets are handled independently of the main training loop, allowing for uninterrupted environment simulations and network training sessions. This is particularly beneficial when dealing with complex FSI simulations, where the computational load can vary significantly. We demonstrate APT’s efficacy using the state-of-the-art off-policy DRL algorithm, Soft Actor-Critic (SAC) [50]. Nonetheless, APT is not exclusive to any specific DRL algorithm; it is generally adaptable to various off-policy algorithms with experience replay mechanisms, such as Deep deterministic policy gradient (DDPG) [52] and Twin delayed DDPG (TD3) [53].
Fig. 4
Illustration of the control Parameter and the observation space for the DRL agent. a Depicts the observation vector of surrounding flow \(\varvec{o}_{flow}\), which includes the stream-wise velocity \(u_x\) probed at the locations indicated by black dots (
); b visualizes the observation vector of fin-ray deformation \(\varvec{o}_{fin}\), comprising the y-coordinates of eight equidistant points (
) along the fish fin ray
×
3 Numerical experiments and results
3.1 Problem formulation and DRL setting
Observation Space To approximate real-world conditions for a fish or fish-like robot, the RL agent’s observation space is confined to the immediate flow field around the fin. Specifically, the observation space \(\mathbb {O} \subset \mathbb {S}\) is a subset of the full state space \(\mathbb {S}\). Namely, the observed state is composed of the x-direction velocity captured by an array of 104 probes, denoted as \(\varvec{o}_{flow} \in \mathbb {R}^{104}\), depicted in Fig. 4a. It also includes an 8-dimensional state vector \(\varvec{o}_{fin} \in \mathbb {R}^{8}\), describing the deformation status of the fin ray at each control step. The complete observation space is thus given by,
Action space The RL agent modulates the root displacement \(\varepsilon \) of the fin ray to achieve the control objective, as illustrated in Fig. 1). Due to realistic consideration, the action \(a_i\in \mathbb {R}\) is subject to the following constraint,
Left panel: the actions \(a_i\) (
) taken by the DRL agent, the corresponding root displacement \(\varepsilon \) (
), and the accumulated thrust \(F_T\) (
) generated by the DRL-controlled fin ray. Right panel: the time series of thrust generated by the max-thrust DRL agent compared with that obtained by baseline method (
) during one episode
Fig. 6
The vorticity field in the maximize thrust efficiency case at the various control steps during the last 20 control steps (\(i\in [60,80]\)) (a–e) compared with the vorticity field at the last time step controlled by the baseline method (f). The position of fish-fin ray is indicated by (
)
×
×
where i indexes the current control step, and \(n = 50\) is the number of numerical steps in one control step. The action \(a_i\) is evenly distributed across each numerical step within the ith control step,
where \(\alpha _{i}\) represents the incremental displacement at each numerical step. The chosen action \(a_i\) is determined by the policy network \(\pi \) with parameters \(\theta \) and current observation \(\varvec{o}_i\).
Neural Network Architecture Our DRL model employs two key neural networks: the policy network, which determines the agent’s actions, and the Q-function network, which estimates the value of action-state pairs. Both networks are constructed as multilayer perceptrons (MLPs) featuring two hidden layers. Each hidden layer is densely populated with 512 neurons, ensuring a robust capacity for learning complex representations of the environment and action spaces. For non-linear transformation within the hidden layers, the Rectified Linear Unit (ReLU) activation function is applied. The output layer of the Q-function network utilizes the identity activation function, providing a direct linear output that correlates with the expected returns of the state-action pairs. On the other hand, the policy network’s output layer employs the hyperbolic tangent (Tanh) activation function. The use of Tanh is particularly crucial as it bounds the output, ensuring that the actions generated by the policy net are confined within the predefined valid range. These architectural choices for the neural networks are designed to balance computational efficiency with the ability to capture the complexity of the control task at hand, ultimately leading to more effective and realistic policy development within the constraints of the modeled environment.
Episode and control step An episode with a duration \(T = 8\textrm{s}\) is segmented equally into \(N = 80\) control steps. Each step lasts \(\tau = T/N\) in time and consists of \(n = 50\) numerical steps in order to keep the control frequency within a practical range. Within each episode, two prescribed motions are applied to the fin ray in addition to the root displacement \(\varepsilon \): the translational movement h(t) and rotation movement \(\theta (t)\) as defined in Eq. 1. One episode corresponds to four cycles of these prescribed motions.
3.2 Baseline control method for comparative analysis
To comprehensively evaluate our DRL strategy, we first formulated a baseline control method for comparative purpose. This baseline leverages prescribed motions h(t) and \(\beta (t)\), characterized by trigonometric functions. Intuitively, we propose that the optimal control strategy for the fin-ray displacement \(\varepsilon (t)\) might adhere to a sinusoidal pattern,
$$\begin{aligned} \varepsilon (t) = \varepsilon _0 \sin {(2\pi f t + \varphi )}, \end{aligned}$$
(11)
where \(\varepsilon _0\) is the amplitude of displacement, and \(\phi \) is the phase shift, both of which are pivotal parameters that are posited to significantly impact the propulsive effectiveness of the fin ray. By introducing a sinusoidal control strategy, the high-dimensional spatiotemporal control space can be substantially simplified into a two-parameter sinusoidal function space, enabling the application of traditional optimization techniques, including an exhaustive grid search, to systematically explore the parameter space and identify the parameters that yield optimal propulsion.
In pursuit of identifying the most effective parameter set, we conducted a systematic grid search within the two-dimensional parameter space. The amplitude \(\varepsilon _0\) varied from 0.0002 to 0.007 in increments of 0.0004, and the phase shift \(\phi \) was adjusted from 0\(^\circ \) to 315\(^\circ \) at 45\(^\circ \) intervals. This approach resulted in 144 distinct scenarios. The performance of each scenario was evaluated based on critical metrics that reflect the propulsive efficiency and control effectiveness. The scenario exhibiting superior performance was selected as the benchmark for comparison. This carefully optimized sinusoidal control strategy provides a direct and relevant comparison for assessing the advantages brought forth by the DRL-controlled approach, thereby validating the improvements in control strategy derived from DRL optimization. More details about the parametric analysis of the fin ray actuation in functional space of sinusoidal movements can be found in our previous work [46].
3.3 Maximize thrust
In the first case, the DRL agent is to learn a control policy \(\pi ^T_\theta \) that maximizes the accumulated thrust \(F_T\) produced over a single episode. The optimization goal is formulated as,
where \(F_{T,i}\) represents the accumulated thrust within each control step, while \(f_T\) represents the instantaneous thrust. N is the total number of control steps within one episode, and \(t_i - t_{i-1} = \tau \) is the time duration of the \(i^\mathrm{{th}}\) control step, where \(\tau \) is a constant. Accordingly, the reward \(r\left( \varvec{s}_i, a_i\right) \) can be defined as the thrust \(F_{T,i}\) at control step i,
Employing our APT methodology, the SAC algorithm guided the DRL agent to an optimal policy \(\varvec{\pi }^T\), aiming to maximize thrust. The agent reached this optimal policy after \(6\times 10^4\) interactions with the environment, producing a total thrust of \(3.2682\times 10^4\mathrm {N\cdot s}\), nearly double that achieved by the baseline optimal control method, \(1.7519\times 10^4\mathrm {N\cdot s}\).
Figure 5 illustrates the dynamic control behavior of the DRL agent, captured through the actions taken \(a_i\), the root displacement of the fin ray \(\varepsilon _i\), and the resultant cumulative thrust \(F_T\) generated over the course of a single control episode.
As observed in the left panel, the DRL agent consistently selects actions of maximum magnitude across all control steps, leading to a pronounced series of peaks and troughs in the root displacement (\(\varepsilon _i\)) profile, signaling a aggressive control policy tailored for optimized thrust generation. The right panel demonstrate the success of this strategy, as evidenced by the steadily climbing cumulative thrust (\(F_t\)) curve, with only minor perturbations due to the periodic nature of the prescribed translational and rotational motions. This pattern indicates that the DRL agent has learned to effectively mitigate adverse factors and fully exploit the available action space, thereby maximizing thrust output throughout the episode. The consistent upward trajectory of the \(F_t\) curve, particularly when contrasted with the thrust generated by the baseline method, validates the DRL agent’s capability to dynamically adjust and improve its control policy, effectively boosting the total thrust.
Figure 6 presents a sequence of vorticity fields captured at five representative DRL control steps within the last 20 steps of an episode, specifically at the 60th, 65th, 70th, 75th, and final control steps of an episode, as illustrated in panels (a) through (e). These frames reveal the intricacies of the fluid dynamics at play, capturing the heightened activity and interaction within the flow as a result of the DRL agent’s control policy. When these fields are compared to the baseline method’s output at the concluding step, shown in panel (f), the contrast is pronounced. The DRL agent’s approach results in a vorticity field marked by a substantially increased number of vortices, which are arranged much closer together. This close arrangement indicates that the DRL agent effectively manages the spatial distribution of vortices, potentially translating to more effective thrust generation. The dense clustering of vortices may reflect a sophisticated control strategy that adeptly exploits fluid dynamics to optimize propulsion.
Fig. 7
The shape and location trajectory of the fish-fin ray during the last 20 (\(i\in [61,80]\) control steps, \(i\in (61, 80]\)), controlled by RL (a) compared with the baseline method (b). Darker colors indicate later control steps
×
The shape and locations of the fin ray also reflects the distinction between the RL agent controlling and the baseline method. Figure 7a shows the shape and locations of the fish fin ray within the last 20 control steps controlled by the RL agent. Compared with the baseline method (Fig. 7b), a more complex movement pattern of the fin ray is apparent. In particular, several more densely clustered regions of fin rays in adjacent control steps can be observed in the RL controlled episode compared to the baseline method, which again indicates the RL agent is able to generate more vortexes and potentially translating to more thrust.
3.4 Maximize propulsion efficiency
In the second case, the DRL agent is expected to find an control policy \(\varvec{\pi }^E\) which maximizes the overall propulsion efficiency \(\eta \) in one episode. The optimization goal can be formulated as:
where p is the instantaneous power consumption introduced by taking the action \(a_i\), while \(P_i\) represents the accumulated power consumption in the ith control step and P denotes the total power consumed by the RL agent in one episode. In contract to the first case, where the control objective can be easily formulated as a summation, the propulsion efficiency \(\eta \) instead appears as a quotient of the total thrust \(F_T\) and total power consumption P. It is extremely challenging to accurately approximate the efficiency \(\eta \) using a summation formula consisting of reward at each control step. To address this issue, we propose a training strategy called “global searching and local Fine-tuning” (GSLF, see detailed explanation in Sect. 4.2), where the training of RL agent is divided into multiple stages and in each stage, different reward functions will be applied to approximate the control objective \(\max _{a_i \sim \pi _{\theta ^E}}\eta \). In particular, here we divided the training into two stages and two different reward functions \(r_{GS}\) and \(r_{LF}\), are applied one by one. These two reward functions are calculated as:
where \(c_1 = 3\times 10^4, c_2 = 4\times 10^3\) are the hyperparameters related to the environment, \(c_3 = 1000, c_4 = 1\) are the normalization parameters.
By applying the APT and the GSLF, the RL agent finds an optimal control policy, as depicted in Fig. 8. The left panel shows the control actions performed by the RL agent (\(a_i\)) as well as the root displacement (\(\varepsilon _i\)).
Fig. 8
Left panel: the root displacement \(\varepsilon _i\) (
) and actions \(a_i\) (
) learned by the RL agent. Right panel: propulsion efficiency at each time step (\(\eta _i\)) controlled by RL agent (
) compared with the highest efficiency control pattern found by the baseline method (
) during one episode
×
Unlike the maximizing thrust case, where RL agent learns an aggressive control policy to max out thrust, here RL agent learns to take actions with moderate magnitudes with smooth transitions between different control steps. In particular, in the last 60 control steps (\(i\in [21,80]\)), a strong periodic pattern can be observed in the action (\(a_i\)) curve as well as the root displacement (\(\varepsilon _i\)) trajectory.This pattern, with a frequency closely matching that of the prescribed motions \(\beta (t), h(t)\) (Eq. 1), suggests that the RL agent has learned to strategically adjust its actions to coordinate with the dynamic system. This alignment of frequencies indicates the agent’s capability to enhance propulsion efficiency by responding effectively to its surrounding environment. While during the first 20 steps (\(i\in [1,20]\)), the environment is transitioning from a stationary state (i.e. initial condition) to a more periodic state introduced by the motion of the fin ray. Such complex transition behaviour is reflected in the right panel of Fig. 8 which shows the history of propulsion efficiency at each control step: \(\eta _i\), which is defined as
It is noteworthy that while the thrust might be zero at a single instantaneous time step, such as during cruising, the efficiency is defined over an entire control episode. Over time, even during cruising (i.e., maintaining constant speed), generating sufficient thrust to overcome drag results in a positive value for \(\eta \) rather than zero. However, with random control, \(\eta _i\) is not necessarily positive because both the numerator and the denominator can be either positive or negative. Thus, \(\eta _i = 0 \) should not be interpreted as the “minimum efficiency”, rather, it should be considered the “neural” efficiency state. During the transitional stage (i.e. \(i\in [1,20]\)), RL-controlled episode has significantly higher propulsion efficiency \(\eta _i\) compared to the baseline method and for most the control steps, RL controlled episode maintains a higher efficiency. Only in the last few steps, does the baseline method achieve a slightly higher efficiency. The distinctive efficiency difference indicates the RL agent is significantly better in controlling a complex transitional dynamic system compared to the baseline method. Although at the last few steps, the propulsion efficiency of RL controlled episode is surpassed by the baseline method, resulting in a slightly lower overall efficiency (\(29.23\%\) by RL compared to \(29.84\%\) by baseline method), the DRL controlling is expected to perform significantly better than the baseline method because the DRL does not rely on any prior knowledge. However, in the baseline method, we enforce the frequency of the root displacement \(\varepsilon (t)\) to be exactly the same value of the frequency of the prescribed motions, which is impossible to achieve in real-world experiments where the other motions of the fin ray cannot be precisely measured/enforced. Besides, RL agent is only trained to maximize the propulsion efficiency in a certain number of control steps and RL has shown promising performance in the transitional stage, which take a significant portion of the overall episode. If the RL is trained to control a longer episode where the transitional stage takes smaller ratio, we believe DRL will achieve higher efficiency in the periodic stage.
Fig. 9
The vorticity field in the maximize propulsion efficiency case at the various control steps during the last 20 control steps (\(i\in [60,80]\)) (a–e) compared with the vorticity field at the last time step controlled by the baseline method (f). The position of fish-fin ray is indicated by (
)
×
Figure 9a–e presents a sequence of vorticity fields captured at five representative DRL control steps within the last 20 steps of an episode, specifically at the 60th, 65th, 70th, 75th, and the final control step of an episode. These frames reveal the flow field influenced by the DRL controlled fish fin ray. When these fields are compared to the baseline method’s output at the concluding step, shown in Fig. 9f, the DRL agent’s approach results in a similar vorticity field, which indicates the DRL successfully learns to leverage the dynamics of the fluid environment by adopting a sinusoidal-like control policy that shares similar frequency as the prescribed motions. The similarity in the distribution of vortices may reflect a sophisticated control strategy that adeptly exploits fluid dynamics to optimize propulsion efficiency.
Fig. 10
The shape and location of the fish-fin ray during the last 20 steps control steps, \(i\in [61, 80]\)), controlled by RL (a) and the baseline method (b). Each line represents the shape and location of the fin ray in one control step
×
The similarity between the RL controlled episode and the baseline method in the maximize efficiency case can be further verified by the Fig. 10, which depicts the shape and location of the fish fin ray in the last 20 control steps. Both the RL controlled fin ray (Fig. 10)a) and the baseline method controlled fin ray (Fig. 10b) share a similar range of the amplitude of the trailing-edge of the fin ray. Although the overall visual similarities, small distinctions between the RL controlled fin ray and the baseline method controlled fin ray can still be observed. In particular, the baseline method controlled fin ray shows a strictly symmetric pattern about \(y=0\). However, the symmetry property is not strictly satisfied in the RL controlled fin ray. Such not completely symmetric pattern also explains the slightly lower efficiency RL agent achieved in the last few control steps. Such asymmetric pattern is also related to the transitional stage where RL agent effectively improve the efficiency compared to baseline method. When compared to the maximizing thrust case (Fig. 10a), RL agent controlled fin ray shows a significantly different pattern, with the trailing edge distribution range shrunken by half, indicating the RL agent effectively learns different strategies to achieve different control objectives.
4 Discussion
4.1 Comparative efficacy of APT with conventional DRL training strategies
Having demonstrated APT’s capability in handling complex FSI problems, we now present a comparative analysis to underscore its advantages. In this section, we compare APT against two conventional RL training strategies: Single Environment Training (ST) and Synchronous Parallel Training (SPT), highlighting the superior sample efficiency and training speed offered by APT.
4.1.1 Testing environment for benchmarking
Direct interaction with high-fidelity FSI simulations is computationally prohibitive for conventional DRL training methods. To facilitate a fair comparison, we employ a one-dimensional chaotic system governed by the Kuramoto–Sivashinsky (KS) equation as a test environment. The KS system is often used as a model problem for turbulence study due to its chaotic behavior [54]. Here, the KS environment is controlled by four actuators equally-distributed in space aimed at minimizing energy dissipation and total power input. The governing dynamics are expressed as,
$$\begin{aligned} u_t + u_{xx} + u_{xxxx} + u u_x = f(x,t), \quad x \in [0,l], , t \in [0,+\infty ], \end{aligned}$$
(18)
where u is the state variable, and f represents the actuator-induced source term. The source term is modeled as a sum of Gaussian functions centered at the actuator locations,
where \(x_i \in \{0,\,l/4,\,l/2,\,3l/4\}\) is the spatial coordinates of the actuators, and \({\varvec{a}}= \{a_i(t)\}_{i=1,2,3,4} \in [-0.5,0.5]^4\) is the control parameters. To minimize the energy dissipation of the system with minimum input power, the reward function is designed as follows,
where T denotes the duration of one control step. The environment is simulated numerically using the finite difference method, where the convection term is discretized by the second-order upwind scheme, and the second and fourth derivatives are discretized by the 6th order central difference scheme. The 4th order Runge–Kutta scheme is used for time integration with a timestep of 0.001 within a spatial domain of \(l=8\pi \) discretized into 64 grid points.
4.1.2 Benchmarking results and insights
Figure 11 compares the performance of APT, ST, and SPT training strategies within the SAC framework in the KS environment. The performance curves clearly demonstrate APT’s superior efficiency in sample utilization and speed of convergence. When running four parallel environment (APT-4
), it exhibits remarkable sample efficiency, achieving optimal policy convergence with fewer than \(1.5\times 10^5\) total interactions with the environment. In contrast, the conventional DRL training strategies (ST and SPT) cannot achieve the optimal policy even after \(5\times 10^5\).
Fig. 11
Performance and training time analysis across different DRL training strategies in the KS environment. The left panel illustrates the performance curves for APT with 8 parallel environments (
) and four parallel environments (
), compared with the performance curves for ST (
) and SPT with 8 (
) and 4 (
) parallel environments. The middle panel presents a linear scale comparison of the training time required by each method, while the right panel offers a logarithmic scale perspective, enhancing the visibility of differences in the later stages of training
×
When utilizing eight parallel environments, the initial phase of APT-8 (
) reveals a quicker ascent in total return, attributed to the increased data availability from the higher number of parallel environments. However, this benefit is transient, as the APT-8 eventually shows a slightly diminished sample efficiency, necessitating under \(2.5\times 10^5\) interactions for convergence. This is because a surplus of parallel environments tends to saturate the replay buffer with outdated data, inadvertently hampering overall sample efficiency. This phenomenon is also observed with the traditional DRL training methods, where scaling up the number of environments in parallel fails to notably enhance sample efficiency. This pattern suggests that the quantity of samples is not the primary constraint; rather, the critical factor impeding the training efficiency of conventional DRL strategies is the suboptimal utilization of the accumulated interaction experiences.
The advantage of APT becomes even more pronounced when examining the training speed. As depicted in the middle panel of Fig. 11,
both APT-4 and APT-8 configurations showcase a rapid initial increase and converge to the optimal policy in \(5\times 10^3\) s, while, ST and SPT require more than an order of magnitude (\(>5\times 10^4\) s) to achieve comparable levels of performance, as further detailed in the logarithmic scale of the right panel. The introduction of more environments in parallel can only have marginal gains in terms of training speed for these conventional methods. In contrast, APT’s asynchronous architecture significantly bolsters both the training speed and sample efficiency by effectively leveraging the already collected dataset.
4.2 Reward formulation for non-additive control objectives
RL intrinsically depends on additive reward functions, yet many control objectives, such as efficiency, are inherently non-additive. For example, efficiency is usually defined as a quotient instead of a summation. This discrepancy necessitates the transformation of non-additive goals into additive reward functions that peak at the same global optimum within the state-action space as the original objective. Identifying such reward functions is often challenging, particularly when the location of the global optimum is unknown, which is typical in all RL scenarios. In this section, we discuss how we design additive reward functions that approximate the global optimum for maximizing propulsion efficiency, which is non-additive.
4.2.1 Transitioning non-addable goals to additive rewards
A straightforward additive approximation of efficiency, as defined in discrete terms (see Eq. 14), is,
yet this form is not an ideal reward function. Apart from failing to align its global maximum with that of the true efficiency \(\eta \), it suffers from instability, particularly when power consumption \(|P_i|\) is minimal. Considering the possible negativity of \(P_i\) in our study, which indicates energy released from the fish fin ray, directly adopting Eq. 21 as a reward function would severely hinder the convergence of DRL training.
To find a stable and accurate approximation, we propose a reward function that captures the incremental change in efficiency caused by each action the DRL agent takes. Accordingly, we derive the following reward function,
Assuming an infinitely long episode allows us to treat cumulative thrust \(\sum _{i=1}^{j-1} F_{T,i}\) and cumulative power \(\sum _{i=1}^{j-1} P_{i}\) as constants. This assumption simplifies Eq. 22 to the following form,
where \(c_1\), \(c_2\) are the constants, representing typical values of thrust/power generated/consumed over an entire episode. For practical training, we introduce normalization constants \(c_3\) and \(c_4\), leading to a reward function as shown in Eq. 15. This adaptation ensures stability and alignment with the global maximum of the original control objective.
4.2.2 Global searching and local fine-tuning (GSLF)
Finding a universally applicable additive alternative for efficiency \(\eta \) is very challenging; however, obtaining localized approximations for various regimes is more achievable. In this work, we introduce a global searching and local fine-tuning (GSLF) algorithm using an set \(\mathscr {R}\) of two reward functions \(r_{GS}\) and \(r_{LF}\) for optimizing the efficiency \(\eta \). Note that the GSLF method is adaptable, capable of incorporating any number of functions sequentially applied during training,
where \(r_{s_i}\) represents the ith reward function used to train the agent; \(\varvec{S}_i\) and \(\varvec{A}_i\) represent the states and actions spaces consists of the bunch of trajectories evaluated by the function \(r_{s_i}\), while \(\varvec{\Omega }_i\) is the state-action space consisted of \(\varvec{S}_i\) and \(\varvec{A}_i\). Each reward function, \(r_{s_i}\), is a “good” approximator within a specific region of the state-action space \(\varvec{\Omega }_i\), avoiding to search for a global approximation. The selection of reward functions follows a strategic sequence: the initial reward should enable stable optimization at a global scale, while subsequent functions should increase in accuracy and specificity around the optimal policy within a narrowing state-action space. This staged strategy uses initial rewards for global exploration (global searching), directing the search towards promising regions that may contain the optimal policy, and later rewards for precise optimization within these high-reward zones (local fine-tuning).
The “global searching” reward functions need not perfectly match the global optimum location of the original control goal within the state-action space, but their greater stability and satisfactory global approximation help to limit the DRL agent’s exploration to a smaller, high-reward area. Conversely, “local fine-tuning” functions may be unstable outside their intended high-reward region or exhibit distinct global maxima; nonetheless, they effectively pinpoint the optimal policy within a confined space or further restrict the search for subsequent reward functions. Ideally, each subsequent state-action space is nested within its precursor (\(\varvec{\Omega }_i \subset \varvec{\Omega }_{i-1}\)), though in practice, exploration may extend beyond prior bounds (\(\{\varvec{S}_i,\,\varvec{A}_i\} \not \subset \varvec{\Omega }_{i-1}\)), albeit within a significantly reduced dimensional scope (\(d_{\varvec{\Omega }_i}\) < \(d_{\varvec{\Omega }_{i-1}}\)).
Practically, we deploy GSLF using two chosen reward functions, \(r_{GS}\) for global exploration and \(r_{LF}\) for local optimization, in in training an RL agent to discover the optimal policy (\(\varvec{\pi }_E\)) for maximum propulsion efficiency. Here, we use the data collected from the DRL training on a shorter episode containing 20 control steps for a better visualization. Figure 12a, b depicts the transition from \(r_{GS}\) to \(r_{LF}\), reflecting a progression from a global, exploratory search to a local, efficiency-optimizing fine-tuning.
Fig. 12
Analysis of GSLF method in the case of maximizing propulsion efficiency \(\eta \). a, b Normalized return during the training process based on three different reward functions: \(r_{GS}\) (
), \(r_{LF}\) (
) and the efficiency \(\eta \) (
). The dashed parts of \(r_{GS}\) and \(r_{LF}\) indicate the reward function is only used to evaluate, while the solid parts represent the reward function is used for training the RL agent. c The weight of the first 45 principle components of the state and action space of the testing trajectories during the training process at two stages: pre-training stage (
) and fine-tuning stage (
). d The distribution of all the testing trajectories chosen during training in the state and action space. The state & action space is projected to a two-dimensional space for visualization based on t-distributed stochastic neighbor embedding (t-SNE) method. The contour is colored based on the efficiency \(\eta \). e–h The testing trajectories (dots) projected to the two dimensional t-SNE space. The contour is colored by the normalized reward function \(r_{GS}\) (e), \(r_{LF}\) (f), the efficiency \(\eta \) (g) and \(r_{LF}\) (h), respectively. The color range of the contours are truncated to high reward regions. The testing trajectories (dots) are colored by the order of the RL agent choose these testing trajectories. In the panel (h), trajectories from \(\varvec{\Omega }_{LF}\) (represented by orange dots
) are added for comparison
×
Figure 12a shows the performance curve during training. During the first half of the training process, the agent was trained with \(r_{GS}\), followed by \(r_{LF}\) in the latter half. Although the return demonstrated consistent growth throughout the early training stage, the actual propulsion efficiency declined, suggesting that the learned policy was confined within a suboptimal region of the state-action space \(\varvec{\Omega }_{GS}\). This region was characterized by a significant deviation of the maximal efficiency determined by \(r_{GS}\) compared to the true maximum efficiency \(\eta \). However, upon initiating the fine-tuning phase with \(r_{LF}\), a notable surge in propulsion efficiency \(\eta \) ensued, as \(r_{LF}\) continued to climb gradually. Notably, the return calculated with \(r_{GS}\) maintained a degree of stability across both stages, despite a slight decrease during fine-tuning. This contrasted with the return as measured by \(r_{LF}\) during the initial global search phase, where notable variability highlighted underscoring its inappropriateness for the initial training stage. However, in the fine-tuning stage, \(r_{LF}\) showcased impressive stability, providing effective guidance towards the optimal policy. It is evident that \(r_{LF}\) offers a closer representation of \(\eta \) in proximity to the optimal policy found in \(\varvec{\Omega }_{LF}\). Yet, completely skipping the global search and exclusively relying on \(r_{LF}\) for training is not feasible, as reflected by the Fig. 12b, where the curve illustrates that the training conducted solely with \(r_{LF}\) yielded negligible improvements in efficiency \(\eta \) and was marked by considerable fluctuations, notwithstanding the consistent return as assessed by \(r_{LF}\). This pattern suggests the policy is entrapped within a local maximum of the uniquely \(r_{LF}\)-defined state-action space \(\varvec{\Omega }'_{LF}\), which substantially diverges from the true optimal policy.
The effectiveness of GSLF can be further demonstrated by Fig. 12c–h. Figure 12c shows the weights of the first 45 principal components of the test state-action trajectories during the global searching and local fine-tuning stages using the Principal component analysis (PCA) method. The log-scaled y-axis accentuates the significant reduction in dimensionality from the global searching space \(\varvec{\Omega }_{GS}\) to the fine-tuning space \(\varvec{\Omega }_{LF}\), evidencing the constraining influence of the \(r_{GS}\) reward function. The relationship between the two training stages is further explored in Fig. 12d, where the combined state-action spaces, \(\varvec{\Omega }_{all}\), including both training stages, are projected onto a two-dimensional space \(\varvec{\Omega }_{tSNE}\) using the t-SNE method.
This projection serves to visualize the distribution of test trajectories selected by the DRL agent throughout the training process. It shows that trajectories generated under the same reward function cluster together within this t-SNE transformed space, while those from different reward function stages are markedly separated, except for the initial trajectories which diverge due to an unrefined policy. The contour in Fig. 12d is colored based on the propulsion efficiency \(\eta \) at these testing points, revealing a multifaceted landscape of efficiency. This landscape is characterized by multiple local maxima, highlighting the complex nature of the policy learning task. Among these trajectories, those belonging to \(\varvec{\Omega }_{LF}\), associated with the fine-tuning stage, are proximal to the regions indicative of an optimal policy. In contrast, trajectories from \(\varvec{\Omega }_{GS}\), representative of the initial global search phase, predominantly occupy regions with lower and more smooth efficiency values. Trajectories from \(\varvec{\Omega }'_{LF}\), on the other hand, are found mostly in areas marked by high variability in efficiency. In Fig. 12e–h, the chronologically colored dots represent the testing trajectories within the t-SNE space \(\varvec{\Omega }_{tSNE}\), derived from different stages of the training process: \(\varvec{\Omega }_{GS}\), \(\varvec{\Omega }_{LF}\), \(\varvec{\Omega }_{all}\), and \(\varvec{\Omega }'_{LF}\). The corresponding contours are colored based on the respective reward functions and control goals: \(r_{GS}\), \(r_{LF}\), and \(\eta \). Figure 12e highlights trajectories from \(\varvec{\Omega }_{GS}\) gravitating towards areas with high rewards as per \(r_{GS}\). An orange dashed box delineates this high-reward zone. Figure 12f depicts the early phase of the fine-tuning stage, where the trajectories initially follow a path influenced by \(r_{GS}\), indicated by an orange arrow. As the training progresses, a gradual shift towards the high-reward areas of \(r_{LF}\), closely aligning with regions of high efficiency, becomes evident. Figure 12g displays all the testing trajectories, offering a comprehensive view of the DRL agent’s progression towards the optimal control policy. Figure 12h reveals a bifurcation in the trajectory paths within \(\varvec{\Omega }_{LF}\cup \varvec{\Omega }'_{LF}\), separated by a notable gap, accentuated by an orange dashed box. It is important to note that the apparent high-reward coloration within this gap is a result of linear interpolation and does not accurately reflect the actual reward landscape. This visual discrepancy is clarified by the consistently high return trajectory shown in Fig. 12a, b.
5 Conclusion and limitation
In this work, we introduced and rigorously evaluated a DRL training approach: asynchronous parallel training (APT). This novel strategy is specifically engineered to expedite the DRL training process, particularly in scenarios where interaction with time-intensive environments, such as high-fidelity FSI simulations, is required. Our application of APT to complex fish-fin ray control tasks demonstrates its exceptional efficacy. In the thrust maximization scenario, the DRL agent equipped with APT achieved a remarkable \(86.6\%\) increase in thrust generation compared to the baseline method. Further, in the pursuit of maximizing propulsion efficiency, we pioneered the "Global Searching and Local Fine-Tuning" (GSLF) methodology. This approach effectively navigates the challenge of approximating non-additive control goals by employing a series of additive reward functions. The successful implementation of GSLF, in conjunction with APT, results in a control policy that matches the peak efficiency achieved by the baseline method. This outcome not only highlights the practicality of GSLF in complex control scenarios but also its potential in broadening the applicability of DRL in various fields. The merit and effectiveness of the proposed APT method is further discussed by comparing it with conventional DRL training schemes within a chaotic system governed by the KS equation.
Despite the efficiency and effectiveness brought by APT and GSLF during the training of the DRL, we cannot guarantee that the current control policy found by the DRL agent for controlling fish fin ray is globally optimal, especially given the non-convex nature of deep learning methods. There remains the possibility that even better solutions exist beyond the explored state-action space. However, compared to the baseline grid search method, DRL explores the high-dimensional state-action space more effectively, making it a more viable approach for this problem, which would be prohibitively expensive for a grid search to handle.
In conclusion, our study advances DRL training for computationally demanding environments, combining APT and GSLF to efficiently train DRL agents for dynamic control of complex systems. This innovation has broad implications for future DRL applications in science and engineering.
Acknowledgements
The authors would like to acknowledge the funds from Office of Naval Research under award numbers N00014-23-1-2071 and National Science Foundation under award numbers OAC-2047127. The authors would also like to thank the anonymous reviewers for their constructive reviews and comments.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.