Sie können Operatoren mit Ihrer Suchanfrage kombinieren, um diese noch präziser einzugrenzen. Klicken Sie auf den Suchoperator, um eine Erklärung seiner Funktionsweise anzuzeigen.
Findet Dokumente, in denen beide Begriffe in beliebiger Reihenfolge innerhalb von maximal n Worten zueinander stehen. Empfehlung: Wählen Sie zwischen 15 und 30 als maximale Wortanzahl (z.B. NEAR(hybrid, antrieb, 20)).
Findet Dokumente, in denen der Begriff in Wortvarianten vorkommt, wobei diese VOR, HINTER oder VOR und HINTER dem Suchbegriff anschließen können (z.B., leichtbau*, *leichtbau, *leichtbau*).
Der Artikel diskutiert die Anwendung von Verstärkungslernen (RL) auf modellgetriebene Optimierungsrahmen (MDO), wobei der Schwerpunkt auf regelbasierten MDO-Ansätzen liegt. Es stellt ein allgemeines Rahmenwerk zur Integration von RL in das MOMoT-Rahmenwerk vor und präsentiert mehrere RL-Algorithmen, darunter Q-Learning, Skalarisierungsansätze und Pareto Q-Learning. Die Autoren bewerten die Leistungsfähigkeit dieser Algorithmen anhand verschiedener Fallstudien und vergleichen sie mit traditionellen meta-heuristischen Suchmethoden wie genetischen Algorithmen. Die Ergebnisse heben das Potenzial der RL bei der Optimierung von Transformationsproblemen hervor und bieten Einblicke in die Stärken und Grenzen verschiedener RL-Ansätze. Der Artikel diskutiert auch die Herausforderungen und zukünftigen Richtungen für die Integration von RL in MDO-Rahmenwerke.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
Model-driven optimization allows to directly apply domain-specific modeling languages to define models which are subsequently optimized by applying a predefined set of model transformation rules. Objectives guide the optimization processes which can range from one single objective formulation resulting in one single solution to a set of objectives that necessitates the identification of a Pareto-optimal set of solutions. In recent years, a multitude of reinforcement learning approaches has been proposed that support both optimization cases and competitive results for various problem instances have been reported. However, their application to the field of model-driven optimization has not gained much attention yet, especially when compared to the extensive application of meta-heuristic search approaches such as genetic algorithms. Thus, there is a lack of knowledge about the applicability and performance of reinforcement learning for model-driven optimization. We therefore present in this paper a general framework for applying reinforcement learning to model-driven optimization problems. In particular, we show how a catalog of different reinforcement learning algorithms can be integrated with existing model-driven optimization approaches that use a transformation rule application encoding. We exemplify this integration by presenting a dedicated reinforcement learning extension for MOMoT. We build on this tool support and investigate several case studies for validating the applicability of reinforcement learning for model-driven optimization and compare the performance against a genetic algorithm. The results show clear advantages of using RL for single-objective problems, especially for cases where the transformation steps are highly dependent on each other. For multi-objective problems, the results are more diverse and case-specific, which further motivates the usage of model-driven optimization to utilize different approaches to find the best solutions.
Communicated by Javier Troya and Alfonso Pierantonio.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
In the last decade, several approaches for model-driven optimization (MDO) have been proposed, e.g., see [1‐4]. These approaches allow to directly apply domain-specific modeling languages to define models which are subsequently optimized by applying a predefined set of model transformation (MT) rules. Different types of objectives can and should be defined for different optimization problems. For some problems, one single objective formulation resulting in one single solution is sufficient. For other problems, e.g., which involve conflicting objectives, a set of objectives that are treated equally is necessary. Then, not only a single solution is reported as the result, but a Pareto set of solutions, i.e., solutions which are non-dominated, are reported as a result of running the optimization processes. In the existing approaches for MDO, meta-heuristic search algorithms have been the predominant approach for realizing MDO [1‐4], either using a rule-based approach, i.e, searching for transformation rule application sequences, or model-based approach, i.e., directly searching for model states.
In recent years, a plethora of reinforcement learning (RL) approaches has been proposed that support both single-objective and multi-objective optimization [5, 6], providing competitive results for several different problems and differently sized instances [7]. However, their application to the field of MDO is mostly unexplored, especially when compared to extensive application and evaluation of meta-heuristic search algorithms such as genetic algorithms for this domain. Thus, there is currently a lack of knowledge about the applicability and performance of RL for MDO. Especially for rule-based MDO approaches, encoding transformation rule applications as sequential decision problems based on RL would allow to cope with the execution order of rule applications. This is a major challenge to deal with in genetic algorithms, e.g., see the crossover operation to build new solutions and potentially needed repair operations as reported in [8]. Furthermore, in contrast to meta-heuristic search algorithms, RL has the potential to utilize the learned experience for future tasks of the same problem domain, e.g., when a problem instance slightly evolves.
Anzeige
In order to shed more light on the potential application of RL for MDO, we present in this paper a general framework for applying RL to MDO problems. In particular, we show how a catalog of different RL approaches exhibiting different properties is integrated with the existing MDO framework of MOMoT [3] as an exemplar of MDO approaches following the rule-based encoding approach. This integration allows to run model optimization problems against the catalog of RL approaches, but also the comparison to existing meta-heuristic search approaches, which have been traditionally used for MDO. We build on this tool support and investigate several case studies for validating the usefulness of using RL for MDO and compare the performance against existing meta-heuristic approaches.
We have previously investigated [9] the adoption of RL for MDO by conducting first experiments with a classical learning-based approach, the Q-learning algorithm, on single-objective optimization cases. The results demonstrated the feasibility of RL, yet left open questions for its widespread adoption in MDO, such as its effectiveness in multi-objective scenarios. In this work, we build on [9] but extensively expand the work as we adopt several multi-objective approaches in our prototype and examine their optimization capability against well-established meta-heuristic searchers, namely genetic algorithms (GAs). Furthermore, we describe in detail how the process followed in MDO accommodates RL techniques, and how such techniques can complement the established architecture in MDO frameworks. Several RL algorithms are then presented, integrated in MOMoT, and evaluated. The evaluation includes new case studies and an in-depth discussion, further extending the previous work. The extended framework and case study implementations are made publicly available [10]. To sum up, the contributions of this paper are as follows:
A comprehensive conceptual framework integration of RL with rule-based MDO;
A collection of adapted RL algorithms and their hybridization for their efficient application for MDO, integrated into the MDO framework MOMoT;
A rigorous case-study-based evaluation of the different RL algorithms for MDO as well as a comparative study against previously employed meta-heuristic searchers.
The remainder of this work is structured as follows. Section 2 outlines preliminary concepts, thereby introducing MDO approaches as general problem solvers and the MOMoT framework as a dedicated representative. Furthermore, we outline RL fundamentals. Section 3 defines a general formalism for the definition of rule-based MT processes that can be solved with learning techniques. In Section 4, we present a multi-faceted selection of learning algorithms for solving rule-based MT processes, which provide different features for solving diverse MDO problems. Section 5 presents the case-study-based evaluation of the presented approach and a comparative study against genetic algorithms. Finally, Section 6 elaborates on related work, while Section 7 concludes the paper with an outlook on future work.
2 Background
In this section, we present the background for this work. We briefly introduce the cornerstones of MDO and one specific MDO framework used as the basis for this work, present a running example, and summarize the preliminaries on RL required for this work.
2.1 Model-driven optimization
Contemporary literature contains several works examining the synergy of Model-Driven Engineering (MDE) [11, 12] and Search-Based Software Engineering (SBSE) [13] to tackle software engineering problems and beyond. We give an overview on the core elements of existing MDO approaches, then delve into the MOMoT framework as one exemplar we utilize for our studies on learning-based model optimization.
Anzeige
2.1.1 Core elements of model-driven optimization
A common theme in MDO is about projecting the initial state of a problem to a model representation, and subsequently, searching for an alternative model representative of a suitable solution state. One key challenge in tackling complex problems by formulating them as optimization problems is to effectively represent the problem on an appropriate abstraction level [14]. In this respect, MDO offers a remedy by utilizing domain-specific models to capture the problem domain and leveraging a generic encoding to explore the design space through MTs in the solution domain. Thereby, it facilitates the integration of the abstraction power of models and the scalability of exploration techniques such as meta-heuristic search, to name just one prominent example. In order to conduct an appropriate search at a particular state, i.e., model instance, the following prerequisites are essential [1, 14, 15]: (i) a representation for solution candidates, (ii) a method to employ search operators to generate new solutions from existing ones, and (iii) a mechanism for evaluating solution candidates. In essence, this also conveys to employing RL techniques, where we utilize a learning paradigm instead of a search.
Solution representation The initial problem state corresponds to a model instance based on a metamodel which captures the general problem domain. For the solution encoding, candidates can be represented again as model instances (e.g., see [8, 14, 16‐19]), or as sequences of transformation steps (e.g., [2, 3]) producing the solution models when executed on the initial model.
Solution generation New models emerge from introducing structural or attribute changes as a result of the application of MTs. MTs either create a new output model from scratch (out-place) or are executed in-place to modify the existing input model directly [20]. The direct manipulation is frequently performed for refactoring or optimization purposes [21]. In the case of using a model-based encoding, MTs are invoked to mutate a candidate model, whereas when using a rule-based encoding, i.e., transformation sequences, mutation and crossover operators are used to add, delete, or replace transformation rules and exchange subsequences in order to derive new sequences [8].
Solution evaluation The evaluation of solution candidates provides guidance in navigating towards promising areas in the search space. Generally, optimization problems can be formulated in the generic form [22]:
$$\begin{aligned} minimize \quad f_i(\varvec{x}) \qquad i = (1, 2, \dots , M), \;\varvec{x} \in R^n \end{aligned}$$
(1)
Functions \(f_i(\varvec{x})\) are called objective functions, each representing one optimization objective. Multi-objective problems involve two or more objectives, i.e., \(M \ge 2\). The decision variables \(x_i\) and their feasible settings denote the search space \(R^n\). As not all solutions in \(R^n\) may be feasible, additional functions \(h_j(\varvec{x})\) may be introduced to restrict the set of all attainable function values, i.e., the solution space, to the feasible region. In the MDO realm, we also rely on the definition of fitness functions to assess a model’s ability to represent a solution for the problem under consideration [14]. These fitness functions are defined on the metamodel in accordance with the modeled problem domain. The search space corresponds to the design space of the model, which includes all models that can be achieved by applying available MTs. In addition, constraint functions or queries can define properties for a model that indicate infeasibility for the intended application context.
Fig. 1
Illustration of the HV indicator for two objective dimensions. It corresponds to the area covered by the Pareto front, i.e., \(S_1\), \(S_2\), and \(S_3\), in relation to the reference point Ref
In the evaluation process, the fitness values are calculated for a transformed model and represented as a potential solution in the form of an objective vector. These solutions can then be compared with the concept of Pareto optimality [23]. Furthermore, the solution sets of two different algorithms can be evaluated against each other using measures such as the frequently utilized hypervolume (HV) [24] indicator. The HV calculates the volume of the region in the objective space that is dominated by the Pareto-optimal set of solutions (i.e., the Pareto front) found by an algorithm, relative to a given reference point. This is illustrated for a 2-dimensional problem in Fig. 1. A higher HV of a solution set denotes superior quality of the contained solutions, in the MDO context of the transformed models, concerning the fitness functions.
From an architectural point of view, the constituents of an MDO framework consist of an environment facilitating the generation of metamodels and models, an MT engine and associated language for model manipulation, and algorithms geared towards conducting a search for transformation orchestrations that both optimize the objectives and adhere to designated constraints. Among available tools [1‐4], MOMoT [3], which is explained in more detail in the next subsection, utilizes a rule-based in-place transformation approach and an encoding based on transformations to search for appropriate rule transformation sequences.
MOMoT [3] denotes a framework that integrates the Eclipse Modeling Framework (EMF) [25] together with the Henshin transformation framework [26] with a diverse palette of optimization algorithms, such as evolutionary approaches from the embedded MOEA framework1 and a set of local searchers. Moreover, we presented an initial study about our efforts to extend the scope for learning techniques [9] which is extended by this work. Fig. 2 shows the input artifacts, the resulting output artifacts, as well as the schema used for conducting the search.
The Problem Instance Model represents the problem to solve in its initial state. Henshin [26] is utilized for its Transformation Rule Language and its transformation engine to define and execute MTs. Objectives and Constraints are provided as OCL queries or Java functions having access to the transformed models. The Exploration Configuration concerns search-related settings such as the algorithm, its parameters, and associated performance metrics for the given problem instance. Throughout the Search-based Exploration, new transformation rule sequences are produced and after application of the rules, the resulting models are subject to an Objective and Constraint Evaluation phase. A set of candidates of transformation sequences remains with the best trade-off solutions from a decision-making point of view. The final output encompasses the Result Models and Rule Orchestrations from the best candidates accompanied by their fitness assessment and search-related Exploration Statistics.
Henshin Henshin [26] includes an engine and language based on the graph-based MT approach [20]. Henshin rules consist of LHS and RHS graphs which describe patterns to match in a model and the changes to apply. During execution, a rule is matched to the graph of a model, and the transformation is carried out successfully only if the resulting graph conforms to the specified changes and application conditions as a matter of formal reasoning. Beyond mere model manipulation, units support a nesting concept and control structures such as loops and conditional branches to manage the control flow.
In the underlying architecture, MOMoT represents solutions as a sequence of transformation units. It thus provides a natural representation for a generic implementation of recombination operators from evolutionary computing [27] but also enables a step-by-step extension of the solution in the sense of a local search. If an error occurs during the execution of the sequence, a repair mechanism handles the error-inducing units to restore feasibility. With the Henshin units used, however, costly repair can be avoided to some extent by using domain-specific constraints.
2.1.3 Motivating and running example
As a motivating and running example for this paper, we consider the static container relocation problem (CRP) [28], also known as block relocation problem [29]. Coming from the logistics domain, container terminals are essential parts of transportation networks with containers being stacked on top of and next to each other in bays while awaiting their retrieval to depart with the intended transportation facility. Naturally, the time to assemble shipment goods affects operation efficiency. Thus, the objective is the efficient retrieval of the containers in the desired order through relocations.
Please note that the CRP may be also natively encoded for RL. However, by following the MDO paradigm, we can represent explicitly the initial problem instances as well as the found solutions for domain experts. In addition, as we will see in the evaluation section (see Section 4), the metamodel, models, transformation rules, and the formulation of the objectives can be reused for different RL approaches as well as other optimization approaches such as genetic algorithms which are available in the MOMoT framework.
Modeling the CRP Formally, the CRP consists of containers \(C=(c_1,c_2,\ldots )\) that are distributed among stacks \(S=(s_1,s_2,\ldots )\) with container indices determining the retrieval order. Naturally, a container can and will only be retrieved if it resides at the top of a stack. A relocation moves a container from the top of one stack to the top of another stack. We consider the unrestricted CRP, which permits relocation from any stack at all stages, an unbounded tier size, i.e., maximum stack height, but posing a fixed width for the stacking area, i.e., no new stacks may be added.
The metamodel in Fig. 4 allows to capture the current container arrangement. A corresponding model instance comprises Stack and Container instances, and holds a reference to the container to be retrieved next. Each stack is located at a position and stores the container currently residing at the top of it, if any. Each container resides on one of the stacks until it has been retrieved and may remain on top of another container. The order of retrievals is implicit, as each container refers to the next required container after its own retrieval. This way, the reference from the root element to the container next in line can be updated with each retrieve operation. We note that alternative designs are conceivable.
The transformation model comprises a single top-level unit (Fig. 3b) which contains two units, Retrieve and Relocate. In each transformation step, one of the two is invoked based on the next required container residing on top of a stack (checked with rule canRetrieveContainer). The retrieve and relocate operator to retrieve and move containers, respectively, in turn consist of several subunits and conditions to keep the transformed model in a consistent state. That includes, for instance, updating the top container of a stack upon retrieval or moving from it. In this respect, the retrieve/move operation is carried out further down within the Retrieve/Relocate unit, e.g., in the rule retrieveLastFromStack, which is depicted in Fig. 3.
Transformation Goal Determining the optimal moves for retrieval has been proven to be NP-hard [30] even without practical considerations affecting efficiency, e.g., each move takes time depending on the distance between involved stacks. We pick up on this in our goal formulation as we seek move sequences to retrieve as many containers as possible and in the right order (RetrievedContainers), while we minimize the number of moves (TransformationLength) along with the total travel distance (TravelDistance) of all moves. The distance calculation for a move is embedded into the MT rules and based on the difference in position of the corresponding source and target stack. These objectives aim to maximize efficiency and are not necessarily conflicting. Nevertheless, it is assumed that the least required moves often do not imply the shortest travel distances. In many cases, taking one or two additional moves between stacks that are closer together can be faster overall.
2.2 Reinforcement learning
RL methods provide a computational approach for solving learning problems through learning from interaction [31]. In contrast to other machine learning paradigms, learning is not based on a training data set, but rather an agent is installed and continually tasked with selecting an action in response to the current state of the environment. In the following introduction, we primarily draw upon the seminal work in this field by Sutton & Barto [31].
2.2.1 General framework
In sequential decision problems, an agent continually interacts with an environment, which exhibits the current problem state. Fig. 5 depicts this interaction with the intrinsic framework of a Markov Decision Process (MDP) [32]. With each taken action \(A_t\) based on the current state \(S_t\) over time steps \(t=0,1,2,\ldots \), the agent receives the successor state \(S_{t+1}\) and a reward \(R_{t+1}\). The agent ought to learn a behavioral pattern - a policy \(\pi \) - that dictates the optimal action for each state to maximize cumulative rewards. This policy maps states to the most valuable actions, guiding the agent’s decisions and exploration of the environment.
Generally, the agent aims to maximize the cumulative reward \(G_t\) = \(R_{t+1}\) + \(R_{t+2}\) + ...over the (possibly infinite) interaction steps. In episodic tasks, the time horizon for interaction is finite, hence, earlier rewards should be prioritized. Thus, the goal shifts to maximizing the expected discounted return \(G_t\) = \(R_{t+1}\) + \(\gamma R_{t+2}\) + \(\gamma ^2 R_{t+3}\) + ...with the discount rate \(\gamma \in [0, 1]\).
The value function V(s) is defined in Equation 2 as the expectation of future discounted rewards starting from a state s and following a policy \(\pi \) as follows:
The value of the terminal state in episodic tasks is always zero.
2.2.2 Value-based learning
Value-based methods use the rewards to approximate the value function V(s) through an action-value function Q(s, a) that learns the values of actions in certain states. Accordingly, Q(s, a) in Equation 3 also considers the action a selected to continue from state s following policy \(\pi \).
The goal is to find the optimal policy \(\pi ^*\) that maximizes the optimal value function \(V^*\). This corresponds to the sequence of state-action pairs that maximizes \(Q^{\pi }\), i.e., the optimal action-value function \(Q^*\).
Solving the sequential decision problem means determining \(\pi ^*\) based on observed state values \(V^{\pi }\). Using the Bellman equation, \(Q^*(s,a)\) can be approximated by finding the actions that maximize \(Q^{\pi }(s,a)\), which expresses the value function \(V^{\pi }(s)\) for any policy \(\pi \) and state s in terms of the action-value function \(Q^{\pi }(s,a)\):
Solving this equation leads to the value function and therefore to \(\pi ^*\). However, in many practical cases, the state space may be unfathomably large, and methods like Q-Learning [33] are employed to approximate the solution. To this effect, Q(s, a) can be approximated using a tabular approach, i.e., learning an explicit mapping from states to action values, or as a parametric policy function \(\pi _{\Theta }\) using function approximation methods. In this study, however, we investigate the former for in-place MTs.
2.3 Synopsis
We discussed in this section MDO as the initiative to solve problems through search-based MTs and explicated fundamental concepts of RL. To sum up, MDO attempts to solve a combinatorial optimization problem with a model and MTs driven by a classical problem definition with fitness functions and constraints. Alternative MTs are generated from other MTs or emerge from randomly adapted MTs. RL methods are aimed at sequential decision problems, which feature an agent that is directed by a reward function. Throughout the learning task, the agent takes a random action or reuses an earlier action previously found beneficial, which will be evaluated by itself and in the context of the subsequent steps.
Adopting the RL framework for MDO introduces novel aspects and nuances that are not present in meta-heuristic searchers conventionally used. In the MDO context, evolutionary algorithms, particularly GAs, are frequently used. Combined with a rule-based encoding, however, GAs may struggle to achieve fruitful exploration through crossover operations when successive genomes in a chromosome are interdependent [8]. This is prevalent, e.g., with the execution order in rule-based MTs where a learning agent composes a transformation solution iteratively with each time step adding another transformation step. Coincidentally, this also omits the need for any repair mechanisms considering non-executable subsequences resulting from crossover or mutation. In addition, common genetic operators require a predefined length for solution candidates, thus limiting the number of rules that constitute the transformation chain. In contrast, the episodic proceeding in a learning system may take more or fewer steps before termination. Moreover, the embedding of ineffective placeholder rules to satisfy the solution length constraint is not required. Another downside of meta-heuristic searchers is the volatility in produced results. Solutions may become obsolete when the problem instance changes, leaving algorithms to run from scratch to find new solutions. Learning-based methods aim to utilize their experience for future tasks of the same problem domain.
RL has already received attention to address optimization problems, e.g., for learning of heuristics to omit the need for tailored approaches for heuristic methods and domain-specific knowledge [7], for preference-based multi-objective optimization omitting the need for prior setting of preferences [34], for solving the multi-objective traveling salesman problem [35]. Specifically, the authors of [35] demonstrate that RL approaches can compete with meta-heuristics performance-wise but also can be applied to new problem instances without retraining. Nevertheless, RL may demand extensive resources and memory whereas meta-heuristic approaches such as GAs are known for fast approximation of good solutions for complex problems. Overall, GAs and RL techniques seem to thrive on different implications, encouraging research efforts toward their adequate use and potential synergy in hybrid solutions.
Despite the possible positive effects mentioned in our initial study on applying RL for MDO [9], we note that RL methods have not been investigated further for the exploration of MTs in the current state-of-the-art. Albeit other works treated certain optimization tasks by applying RL methods, we are not aware of a generic formalism that aligns the formulation of an RL problem with the elements of a rule-based MDO approach. Therefore, we put forth such a formalism, which subsequently enables us to adopt various RL methods for generic optimization in the MDO framework. We demonstrate this in the form of an extension of the MOMoT framework [3], utilizing the existing solution coding and transformation approaches for MDO. This allows for setting up a comprehensive toolbox that supports the use of different existing meta-heuristic searchers as well as RL approaches for the same problem definition.
3 MDP-ROMT: A framework for learning-based MDO
In this section, we present how MTs can be formulated to apply to the reinforcement learning approach by treating them as sequential decision problems. We also demonstrate the application of this formulation for our running example. Finally, we show how this formulation is embedded in the existing MDO architecture of MOMoT.
3.1 Learning-based model transformations as sequential decision problems
All methods belonging to the field of RL apply to sequential decision problems [31], which can be modeled with MDPs [32]. To enable learning techniques for MDO, we adopt the formalism for solving optimization tasks through orchestrating MT rule applications. Therefore, we first state the general model used to formalize RL tasks, then derive a formalism that maps the corresponding concepts of that model to the artifacts used to search for problem solutions in the MDO sphere.
An MDP [32] formally defines a sequential decision problem as a tuple (S, A, T, R), where:
S is the set of states in an environment.
A(s) is the set of actions executable in a given state s.
\(T(s, a, s')\) is the transition function and denotes the probability of transitioning to state \(s'\) when executing a in s. In Markovian environments, the probability distribution over possible successor states only depends on the current state and action, i.e., \(P(s' \mid s,a)\) = \(T(s, a, s')\),
\(R(s, a, s')\) is the reward function to assess the benefit of performing action a in state s and ending up in state \(s'\).
Extending the MDP to multiple objectives, the sequential decision process is subject to multiple criteria, where the reward is expressed as a reward vector \(\varvec{r}(s,a) = \begin{pmatrix}r_1(s,a),&r_2(s,a),&\ldots \end{pmatrix}\). The agent’s goal then is to choose actions that yield a good trade-off between the objectives.
This framework can be adapted in the context of MDO as an amalgamation of modeling artifacts and a common definition of optimization problems. Specifically, we can formulate rule-based MTs as MDP such that we sequentially select rule applications that produce a model with high fitness for defined objectives. We define an MDP-Based Rule Orchestrated MT (MDP-ROMT) formally as a tuple \((S, s_0, A, A_P, R, T)\), where:
S is the set of models observable by the agent as a result of rule applications, and the initial model state \(s_0\).
\(s_0\) is the initial model instance.
A(s) covers all rule applications that can be executed on model \(s \in S\). Rule applications denote the rule a with the specification of the parameters, \(A_P(a)\), that are defined for this rule.
\(A_P(a)\) designates the settable parameters that specify the application for a rule a to a current state s. Therefore, these parameters condition the assignment of a value to specify the transformation before execution.
\(R(s, a, s')\) defines the reward for executing rule a with assigned parameter values on the current model s and obtaining result model \(s'\). Following a notation of state-based rewards, \(R(s, a, s')\) corresponds to the objective value observed on the result model \(s'\) after executing a in s.
T denotes the transformation termination criterion. For example, a maximum transformation length could be defined that limits the transformation steps before the agent starts a new learning episode from the initial model \(s_0\).
The integration of the principal RL system with the constituents of a rule-based MDO framework is depicted in Fig. 6. The agent interacts with the environment by selecting a rule a with specified values for parameters according to \(A_P(a)\). The environment responds with the transformed model \(s'\) after applying a on the previous model state and the reward r, which assesses the objective development based on the transformed model \(s'\). To find an optimized target state of a model, the agent learns to choose rules successively that, when executed on the initial model, produce the result models that optimize the target objectives. Accordingly, the result of running an MDP-ROMT is a policy \(\pi \), which returns the transformations for \(s_0\) that induce output models with high quality regarding the objectives.
Compared to MDP, in MDP-ROMT states are embodied by models, and actions are replaced with transformations in the form of rule applications. The termination criterion T ensures the transformation process is finite. In this regard, shorter transformation sequences are commonly preferred, e.g., as we aim to minimize the operations needed to retrieve all containers in our running example. The initial model, \(s_0\), denotes the initial state of the environment. Note the absence of the transition function since we assume deterministic transformation outcomes, i.e., \(T(s,a,s') = 1\). In other words, the target model \(s'\) is always identical after applying rule a with the same parameter values on a model s. Furthermore, the Markov property inherent to MDPs still holds for MDP-ROMTs. That is, given the current model state and a rule application, the model after transformation only depends on the current model and executed rule application.
The reward function should indicate how the optimization targets develop as the model evolves. Naturally, the objective fitness functions for model evaluation can be leveraged. The reward following a rule application on model s can be determined as the delta between s and the successor model \(s'\).
Following Equation 5, for a maximization objective, an increasing objective value in the evolved model \(s'\) after rule application a yields a positive reward whereas a decrease is penalized with a negative reward to discourage such transformations. Conversely, finding higher objective values for \(s'\) is penalized and a decrease is rewarded for minimization targets.
Constraints are used in MDO to validate models resulting from MTs in terms of them representing a solution to the underlying problem. Invalid solutions may be filtered out completely or may receive a low ranking dependent on the magnitude of the constraint violation. In our example, such a constraint could be a time condition for retrieval of the containers, which may be declared through a threshold on the total travel distance. Several treatments are feasible, such as penalizing MTs leading to invalid states or marking them as invalid and prohibiting their reuse. Nevertheless, in some cases, invalid solutions can represent indispensable intermediary steps towards a valid solution. Prohibiting rule applications that lead to such states then renders parts of the solution space unattainable. In this respect, appropriate treatment upon encountering constraint-violating models depends on the problem representation and the use case at hand.
3.2 Application for the running example
For the CRP, the initial problem state \(s_0\) captures the container allocation in the container bay. Executable transformations are defined with the transformation unit RetrieveElseRelocate (\(a \in A\)) and parameters (\(A_P(a)\)) that specify the transformation process, which here corresponds to identifiers of the model elements container, from, and to (cf. Fig. 3b). Accordingly, S encompasses \(s_0\) and all variations of it attainable through the application of this unit to move or retrieve containers.
To assess the retrieval progress, we can use the objectives described in Section 2.1.3. For one, we aim to maximize the number of retrieved container instances. Consequently, upon retrieval, the agent receives a reward \(r_{1} = 1\), and for moves, \(r_{1} = 0\). Likewise, we could opt to reward each execution of the Retrieve subunit, which will be contained in the invocation tree of the applied instance of RetrieveElseRelocate. To estimate the efficiency, the travel distance aims at minimizing the total travel distance of all moves. Therefore, each move yields a (negative) reward \(r_{2}\) corresponding to the distance between the source and target stack. Further, \(r_{2} = 0\) for each retrieve operation. Note that a discount \(0< \gamma < 1\) encourages the agent to maximize the rewards from retrieving containers and traveling distances with as few moves as possible.
As for the termination condition, we may use a desired target state. This is that there should ultimately be no more container instances left in s. Alternatively, if we know the required number of transformation steps for a desirable solution, the training episodes could be limited to a fixed number of rule applications. This also facilitates the comparison with search strategies limited to a fixed-length representation for solution candidates.
3.3 Extending the MDO framework architecture
We reformulated the problem definition and associated artifacts for MDO approaches as a sequential decision problem. To this end, techniques from the RL field become applicable [31]. Now we describe the requisites to extend the scope of the MDO framework accordingly.
Fig. 7 shows the intended additions for the MDO framework architecture in support of an RL ecosystem. RL techniques provide an alternative optimization strategy, where an agent is instantiated for orchestrating transformations to explore the input model’s design space. Learning algorithms may be implemented as tabular methods or function approximation methods, the latter potentially being initialized with a previously learned policy. Apart from the output policy, statistics capture the learning progress, e.g., rule selection statistics and convergence metrics.
The search engine incorporates the same utilities as for the use of search-based algorithms (cf. Figure 2). However, it also provides the interface for agent-environment interaction for the use of RL techniques. This includes the emission of the current model state, the application of a received rule, and the generation of the reward after the objective and constraint evaluation. The evaluation also builds on the existing functionalities regarding the intended encoding as rule sequences, execution of the transformation rules, calculation of the fitness, etc. The problem configuration remains as input with additional configurations relating to the initialization of the learning algorithms.
From the user’s point of view, the seamless integration enables RL techniques to be used alongside search algorithms without further ado, apart from algorithm-specific parameters. Thus, the same optimization artifacts are produced as after the conventional search process. Nevertheless, for function approximation methods there must be the option for re-initialization with a model trained in the past (e.g., the parameters of a neural network). Regarding reward calculation, the user may supply a function where the calculation is based on the transformed model and/or the executed rule instead of fitness functions.
4 Adapting a catalog of learning techniques for MDO
At the time of writing, there are many RL algorithms [5, 6] that fall into a range of different categories [36, 37]. The value-based approach (Section 2.2.2) learns a mapping of model states and MTs to corresponding Q-values. In contrast, policy-based approaches [31] learn a policy represented as a parametric function using function approximation methods. To this effect, the values in Q(s, a) can be approximated using a tabular approach or as a parametric function. For the latter, neural networks are commonly instrumented, which require an encoding to project the model state and rule applications to numerical representations for processing. Since we first and foremost aim for broad applicability, we focus on tabular methods in this paper. A key difference among value-based algorithms is how the rewards are processed to exploit transformations later on.
We first provide an overview of selected approaches and their properties. Then, we elaborate on the Q-learning algorithm we adopt as our baseline for supporting learning-based MDO. Moreover, it provides the foundation for several descendent algorithms capable of handling multi-objective problems, three of which we consider for their application to MDO. Ultimately, taking a step towards hybrid approaches, we propose a universally applicable strategy to improve the performance of the presented algorithms in challenging exploration problems.
4.1 Overview
We choose four algorithms, which we describe in the following and which we have implemented to be available in MOMoT, for differently nuanced settings to leverage explored MTs. First, we consider Q-learning [38] as one of the most renowned RL algorithms. Second, we look into scalarization approaches as we adopt a weighted sum solution based on the Chebyshev metric [39]. Third, we adopt a variant of the W-Learning approach [40] that strives to improve all objective dimensions by seeking the most affected objective in the exploitation phase. The fourth algorithm, the Pareto Q-learning (PQL) by Moffaert et al. [41], deviates substantially as it refrains from approximating action-value functions, and instead learns the immediate reward and future rewards separately.
Table 1
Classification of the selected RL approaches (\(\checkmark \)) Fulfilled (–) Not fulfilled
Table 1 provides an overview of properties that can guide in the selection for an appropriate application scenario. In this regard, the classical Q-learning definition assumes a single-valued reward, and hence, does not support multi-objective use cases (1) whereas others are aimed specifically at such problems. Nevertheless, Q-learning has been established to converge to the optimal solution under certain conditions [38]. PQL learns multiple policies (2) for optimization in each objective dimension in a single execution whereas \(Q_{Chebyshev}\) and \(Q_{W}\) merely take into account the objectives to converge to a policy that leads to a single solution. Further, user preferences (3) can be expressed to emphasize the optimization intent for \(Q_{Chebyshev}\) and PQL. Of the listed algorithms only PQL is unaffected by objective values on different scales (4), whereas in others those of higher magnitude will be dominant if they are not normalized. The memory consumption (5) depends on the number of encountered states and actions. Depending on this, a scalar or, in the case of multiple objective, a vector is mapped to a state-action pair. With PQL, several vectors are stored dependent on existing Pareto-optimal solutions.
For our prototypical implementations, we commit to the tabular approach, which maintains the value estimates for models and rule applications in a table-like data structure. This implies a memory limitation as increasingly more models and rule applications are observed, which is to be taken under consideration for the problem at hand. All the algorithms presented below are to some extent based on the same principles as Q-learning [38], which has desirable properties such as guaranteed convergence. They all belong to the methods of temporal difference learning [31].
4.2 Q-learning as a baseline
Q-learning is an off-policy control algorithm based on temporal difference learning [31]. Accordingly, updates are based on the bootstrapping rule in Equation 6, which enables an approximation of the action-value function Q(s, a) through step-wise updates controlled by a learning rate \(\alpha \) and based on future reward estimates.
Note that the update rule in Equation 6 ignores the actually selected action in \(s_{t+1}\). Concretely, it presumes the highest-reward action to be picked in the next state with \(\max _a Q(s_{t+1},a)\). However, the \(\epsilon \)-greedy approach may have the agent opt for a random action instead. In that case, the future reward estimate diverges from reality, which is indicative of off-policy approaches. This is in contrast to “on-policy” algorithms like SARSA [42], where the update depends on the action executed in the following state. Off-policy algorithms are considered more sample-efficient and faster in converging to the optimal policy but at the cost of learning stability. Given a sufficient number of updates, Q(s, a) converges to the optimal action-value function \(Q^*(s,a)\) [38].
Algorithm 1 gives a rundown on Q-Learning adopted for in-place MTs. Parameters for Q-Learning are the discount rate \(\gamma \) and exploration probability \(\epsilon \). The environment is initialized with the initial model \(s_0\), the reward function, and an interface to consult the transformation engine. The reward function R(s, a) provides the objective fitness value, which may result from evaluation of the result model \(s'\) or executed rules, after applying rule a on a current model s. Initially, we start with Q(s, a) as an empty map. The learning process then runs through a fixed number of transformation steps before termination (line 5). Starting from \(s_0\), transformations are executed by sequentially picking rule applications following an epsilon-greedy strategy (lines 8-9). After executing each rule application, we receive the transformed model \(s'\), observe the reward r from evaluating the objective fitness values of the model, check if the termination criterion T is satisfied, and finally update Q(s, a) (lines 10-11). For our purpose, T will be set to limit the transformation length. The subsequent iteration proceeds with the transformed model \(s'\).
The hereafter described approaches pertain to fundamental concepts of Q-learning and extend to multi-objective problems, in general as well as for MDO formulations, by dealing with multi-dimensional rewards.
4.3.1 Scalarization approach
A prevalent approach to multi-objective optimization is to derive a combined objective reflecting the user preferences [43]. The preferences are expressed as a weight vector to scale the observed objective values, which are then aggregated with a scalarization function. The resulting value can then be treated as a single objective. In the simplest form, the function is a linear combination \(\sum _{o=1}^{m} w_i*f_i(x)\).
Similarly, for multi-objective RL, a weight vector can be specified to scale rewards in \(\varvec{r}(s,a)\) as they are linked to more or less important objectives [6]. Combining rewards from several utility functions for estimating action values was first studied in [44]. In the present approach, a separate action value function \(Q_i(s, a)\) is approximated for each objective i using the values observed from the reward vector. Meanwhile, the scalarized value is computed with function SQ(s, a), which is used for the action selection and updates of \(Q_i(s,a)\):
$$\begin{aligned} SQ(s,a) = \sum _{i=1}^{m} w_i * Q_i(s, a) \end{aligned}$$
(7)
Alternative to the linear function, the Chebyshev metric [45] (see Appendix A.1) can be used to calculate SQ(s, a). In addition to weight scaling, a current estimate of the utopian point is factored into selection, which has shown to yield a performance increase [39].
Aggregation to a single value is generally simple and computationally efficient, making it also appropriate for evaluating the fitness of models in the context of transformation problems. For each fitness function used to evaluate the model, a value function \(Q_i(s,a)\) is initialized and updated. During selection, their values are summed for a given transformation in a given model state (\(Q_i(s,a)\)) after each is scaled by the according weight. However, the scalarization approach might not be appropriate if the weight specification is difficult or objectives cannot be reasonably projected to a single metric, especially if the fitness values have different orders of magnitude.
4.3.2 Negotiated W-learning
This algorithm adheres to a compromise that guarantees the optimal selection of an action for at least one of the objectives [40]. Similar to the weighted sum approach, each objective is associated with an agent approximating a dedicated action-value function. Instead of aggregating the criteria aiming for a reasonable trade-off, W-learning selects one objective to prioritize at each step. Accordingly, the expected outcomes for an action in the current state are compared between agents, i.e., from the perspective of each criterion. This comparison can be seen as engaging in a negotiation for the next executed transformation (see Appendix A.2 for more details on the negotiation procedure).
For an intuitive illustration, consider our running example, dedicating an agent to the retrieved containers (\(A_{RC}\)) and the travel distance (\(A_{TD}\)), respectively. An initial proposal from \(A_{RC}\) for a transformation \(t_a\) could move a container to enable retrieval of underlying containers. However, to the detriment of \(A_{TD}\), the relocation involves a long traveling distance (expressed by a weight \(W_{TD}\)). Hence \(A_{TD}\) can suggest a transformation \(t_b\) for an alternate move operation with a shorter distance. Yet, with \(t_b\) further retrieval is hampered (as is quantified by \(W_{RC}\)), which exceeds the disadvantage of \(t_a\) for the assessment of \(A_{TD}\), i.e., \(W_{RC} > W_{TD}\), again enabling \(A_{RC}\) to intervene for a new counter-proposal. The finally executed transformation is ultimately decided by the agent to whom the counterpart cannot assert a greater disadvantage for its objective for not following its suggestion.
For practical considerations, W-learning constitutes an unbiased approach concerning explicit objective prioritization, and the memory requirements are comparably marginal. In this respect, there is no preference setting, but a trade-off of all optimization criteria for the learned policy is inherent to the continual change of decision-making power. Notably, an action-value function \(Q_i(a,s)\) is still learned for each criterion although a single policy is learned as all agents collectively decide which transformation they all execute. Despite the weights being continuously calculated in the negotiation procedure described, they do not have to be learned or stored but are obtained ad-hoc from \(Q_i(a,s)\).
4.3.3 Pareto Q-learning
Whereas the scalarization approach and W-learning attune to a trade-off in criteria for the learned policy, either by weighting objectives or implicitly, PQL [41] accommodates objective vectors instead of action-value functions. With the latter, the update rule for Q(s, a) in Equation 6 is based on the immediate reward and expected future reward. Here, however, for each state-action pair an average vector \(\overline{{\mathcal {R}}}(s,a)\) is maintained whereas the future reward corresponds to a (Pareto) vector set \(ND_t(s,a)\). After each step, the sole agent adds the observed immediate reward \(\varvec{r}\) to \(\overline{{\mathcal {R}}}(s,a)\) and updates \(ND_t(s,a)\) with the vectors resulting from adding the immediate reward and the expected future rewards from the successor state, \(\hat{Q}_{set}(s',a')\). For details on the update rules as well as the usable action selection mechanisms, please refer to Appendix A.3.
Reward decomposition in PQL allows for the preservation of individual goal contributions and therefore leads to a variety of solutions for different preference scenarios. Since only the non-dominated set is retained during updates, any action that is later reconsidered is part of a non-dominated strategy. In MDO terms, this ensures that each time the agent reuses a transformation, it is one of a transformation sequence that produces a model that is optimal for at least one of the criteria. This potentially allows learning several optimal policies along the entire Pareto front. Nevertheless, it does not guarantee optimal or even better solutions for MDO problems with large design spaces due to limited exploration. Implementation-wise, the algorithm requires more memory to store vector sets instead of scalar values.
4.4 Enhancing exploration through local search
The effectiveness of RL comes from learning state values that include the rewards that can be expected in subsequent states. The basis for this is sufficient exploration to learn the values of as many states as possible. The exploration strategy regulated by epsilon (\(\epsilon \)) by default executes a random step, which is decisive for the remaining steps in the learning episode. This can hamper the learning progress substantially, especially in hard exploration problems with huge state spaces and long sequences. Referring to our running example, moving a container to access a particular container might temporarily place the moved container in a position that obstructs other containers that will need to be accessed later. As a result, more relocations will be required in total, making the retrieval process less efficient. Nevertheless, in some problem domains, there is no alternative to random selection as the effects on the state can only be observed after executing the action.
If states and outcomes of actions can be simulated for a domain, however, this can be utilized for training, as demonstrated by some work on robots for RL tasks [46‐49]. To this effect, a simulator can be used to restore a state, e.g., to learn how to best deal with problematic conditions before deployment. Such simulations are also feasible for learning MDO as we can perform different MTs and evaluate all resulting models before proceeding from one of them.
To improve the exploration capabilities in adopted RL algorithms, we propose an exploration strategy in the form of a local search. Consequently, instead of committing to a random rule application (Algorithm 1, line 8) issued from the employed transformation engine, we use the same to consider several feasible rule applications for the current model. The procedure of this strategy for the multi-objective case is outlined in Algorithm 2. As a preliminary setting, N defines the predefined number of simulations to perform for exploring from a current state s. In line 4, we elicit a rule application a and the corresponding reward \(\overrightarrow{R}\) from the environment simulator. The scalarization step (line 5) computes a scalar reward SR(s, a), and appends it to a list with results from other simulation turns. The scalarization function can be expressed as the reward sum, or be tailored to the algorithm used, e.g., a weighted sum approach (cf. Section 4.3.1). Taking a greedy approach, in line 8, we return the action maximizing SR(s, a), i.e., \(\mathop {\mathrm {arg\,max}}\limits _{a}\)\(SR(s,a) \in SR\_List\). For the single-objective case the procedure is identical with the multi-valued reward \(\overrightarrow{R}\) becoming a scalar value R that goes directly into the list. In any case, the local search described replaces the random step in the \(\epsilon \)-greedy action selection strategy when used.
The greedy local search can significantly accelerate the movement towards fruitful regions in the search space since much better states are identified from the beginning but it also threatens the convergence guarantee. After all, finding an optimal policy requires a sufficient sampling of all states [38]. Even theoretically, this may no longer be the case, namely if the greedy strategy in state s consistently prioritizes an action over the one that is part of the optimal policy. That said, sufficient sampling is infeasible in correspondingly large or even infinite search spaces to begin with and the aim is to achieve the best possible approximate solutions, as is also often the case with MT problems.
5 Evaluation
In this section, we evaluate our extension of the MOMoT framework [3] for RL. Our evaluation methodology adheres to the guidelines for conducting empirical case studies presented in [50]. We first describe the research questions (RQs) that motivate our effort and that we aim to answer, followed by the setup of our case-study-based evaluation. It includes a description of each case we consider, the evaluation metrics used, the parameter settings for the algorithms, and the testing infrastructure. We then list the results and proceed with a discussion of the findings. The final part of this section is concerned with a discussion of validity threats that may affect our results.
5.1 Research objective
We aim to evaluate to what extent RL techniques constitute a reasonable alternative to established meta-heuristic search approaches in finding solutions for transformation problems. The integration of additional algorithms is intended to uphold MDO’s claim of generic applicability while supporting the same application process from the user’s perspective through seamless integration with existing technologies. We can ensure this by conducting several case studies where we use the afore presented algorithms (Section 4) in experimental setups next to an established search algorithm. Then, the performance in terms of optimization capability and the time to find solutions is of interest. We address and explicate this research interest with the following two RQs:
RQ1–Applicability Are RL methods suitable for solving in-place MTs across diverse scenarios? What methods are applicable and what are the limitations to optimize transformations in single-objective and multi-objective settings?
RQ2–Performance What are the advancements for MTs brought forth by RL methods in terms of objective optimization? In particular, we are interested in comparing RL with search-based approaches to understand their (dis-)advantages for achieving appropriate objective realizations.
To answer the RQs, we base our conclusions on several measures (Section 5.2.2) applied to the results we obtained in our case studies.
5.2 Study setup
This subsection specifies the experimental setup before presenting results in the subsequent subsection. We describe the cases studied, the parameter settings, the evaluation methods, and the hardware used to run the experiments.
5.2.1 Case studies
Our selection includes several cases frequently used for the evaluation of MDO approaches, for whose representation the respective domain models are available, and objective formulations for evaluating a model’s fitness have already been devised. These include refactoring tasks, design model optimizations, as well as a common workflow from the logistics domain. An overview is provided in Table 2. The artifacts used in each case study are available in our public project [10].
Each of the case studies comprises multiple instances of varying complexity, the results of which we consider collectively in order to gain insights with regard to the research objective. The problem domains are described in terms of the model abstractions, the transformation rules used to define the search space, and the objective functions for measuring the degree of optimality.
Please note that the selected case studies are all endogenous, i.e., transformations within the same modeling language, and horizontal, i.e., staying on the same abstraction level [51]. While other types of transformations may be also formulated as MDO problems in MOMoT, we leave the evaluation of RL for such transformations (exogenous and vertical ones) as subject to future work.
Create root class (3) Extract super class (3) Pullup attribute (2)
#Entities+ #Properties (\(\downarrow \))
The arrow next to the objective name indicates whether it is to minimize or maximize. Objectives in bold text are used for the single-objective formulations
a Unit RetrieveElseRelocate consists of the condition rule canRetrieveContainer and two sub-units, Retrieve and Relocate, which together involve 19 other units and rules.
b CRA-Index is exclusively used for the single-objective formulation as it combines cohesion and coupling in one single objective
Container Relocation Problem (CRP) We refer to Section 2.1.3 for a general description. As objective functions, in the multi-objective evaluation, we use (1) the Container-Index and (2) the travel distances between stacks over all moves. In the single-objective evaluation, we omit the travel distance and only consider the Container-Index as a combined measure factoring in the retrieved containers and a metric for the amount of blocked containers, called blocking factor. The global blocking factor (\(BF_{G} \in [0, 1]\)) quantifies the containers having one or more containers of lower priority stacked above them. For a container \(c_i\) (with i implying the container index) we denote the blocking degree (BD) in Equation 9 as the number of containers above it having a larger index on the same stack \(s_i\) (with i implying the stack index). The BD is normalized using the triangular number \(T_n = \frac{n*(n+1)}{2}\), which gives the number of possible blockages on a stack with \(n+1\) containers. Factors are summed up over the stacks and divided by the number of overall containers |C|. Finally, the retrieved units and \(BF_{G}\) are combined in the combined objective in Equation 10.
The four model instances range between three stacks and eight containers to ten stacks and forty containers.
Stack load balancing (SLB) Already used in the context of MDO in [3, 52], this case study concerns the distribution of items in a system of stacks. The StackModel (cf. Fig. 8) consists of multiple stacks (Stack) that are connected in a circular way with each stack having two neighbors, left and right, and a load assigned.
Using two rules, shiftRight and shiftLeft (defined analogously), stacks can shift load to their neighbors. Feasibility of the shift operation depends on available units on the providing stack and is guaranteed by a precondition. Changes are defined by rule parameters specifying the sending and receiving stack, and the amount of shifted items.
The transformation goal is to shift load accordingly to balance the load distribution. That is, to minimize the standard deviation over the loads of all stacks.
With increasing shifting possibilities, the state space increases exponentially with each additional stack, which complicates the steps to reach an equal distribution. We use problem instances with 5, 10, 25, or 50 stacks.
Class Responsibility Assignment (CRA) The CRA problem stems from the quality assurance domain of object-oriented systems [53] and has been extensively used in the Transformation Tool Contest as a particular transformation case [54]. Given a set of methods and attributes, it aims for an assignment to classes that well separates their concerns. This is of interest to raise the quality of software models and consequently the generated code. As defined in Fig. 9, the ClassModel consists of classes (Class) which hold a set of features (Feature). A feature is either a Method with functional (functionalDependency) or data dependencies (dataDependency), or an Attribute.
The refinement of a design is covered with a single rule reassignFeature which moves a method or attribute from one class to another. This requires assigning each feature to a separate class in the initial problem instance, which therefore holds as many classes as features are present. In the final step after the algorithm has terminated, empty residual classes without any assignments are removed.
Models are evaluated using CohesionRatio and CouplingRatio as common metrics to measure separation of concerns [54]. The former indicates the concentration of dependencies in the same classes and should be maximum. The latter counts dependencies between features of different classes and should be minimal to indicate an extensive separation of independent system aspects. The formulations used consider the method-attribute and method-method combinations formable within/between classes. Specifically, for two classes \(c_i\) and \(c_j\), the total sum of dependencies across all feature combinations is counted. For CohesionRatio, combinations are considered for class pairs where \(c_i = c_j\), meaning feature combinations within the same classes are reflected, and whose dependencies should be maximized as a result. For CouplingRatio, combinations are considered for class pairs where \(c_i \ne c_j\), meaning the dependencies between features of different classes are counted, which should therefore be minimized. The same schema is applied for counting dependencies between methods. For the single-objective variant the two metrics are combined to a single measure, CRA-Index, defined as their difference CohesionRatio - CouplingRatio [54].
In our study, we use five design models from the Transformation Tool Contest in 20162. As pointed out in [54], the number of possible arrangements such that every feature is assigned to a class is given by the Bell number \(B_{n+1} = \sum _{k=0}^{n} {n\atopwithdelims ()k} B_k\). The n-th number in this sequence, \(B_n\), denotes the number of possibilities to distribute n elements into disjoint subsets. In the smallest instance considered, this means 21147 possible result models for 9 features. The next instance with 18 features exhibits 682076806159 possibilities already.
Refactoring This study is considering the class diagram restructuring case from the Transformation Tool Contest [55], and thus, aims at improvements of an object-oriented design model by abstracting classes that share common data features to accomplish a less redundant structure. It involves identification of duplicate attributes in classes that could alternatively be accessed from a common object on a higher level, and derive new classes that share features of related entities. As illustrated in Fig. 10, a model consists of entities (Entity) and properties (Property) with inheritance relationships depicted as Generalization. That is, an entity A inherits properties of another entity B if A has a generalization relationship to a GeneralizationG and B is the target entity of G (general). In addition, properties have a type (Type).
The refinement transformations are defined with three rules: First, createRootClass introduces a new entity C with a property a previously declared in two subclasses that henceforth inherit this property from C. Second, the rule extractSuperClass adds an intermediate level to the inheritance hierarchy. More precisely, a new entity C becomes the superclass of classes that share a common property a and previously derived from a class S. Entity C then declares a and inherits from S, exposing a to entities that individually declared a before this change. Third, for classes sharing a common superclass S and declaring an attribute a, pullUpAttribute removes a from these classes and re-declares it in S.
The aim is to achieve the lowest possible redundancy for elements used in the design model. To this extent, the objective function used minimizes the total sum of entities and properties.
We consider two different problem instances. For the model denoted as Ref-1, attributes need to be moved over multiple inheritance layers to the top of the hierarchy, for Ref-2, several of the mentioned activities need to be performed to optimize the model.
5.2.2 Evaluation metrics
For the applicability question (RQ1) we focus on potential obstacles one may encounter when using RL for in-place MTs, but also necessary preliminary steps and limitations in the application. More practically, we show the feasibility by instrumenting implementations of the algorithms described in Section 4 next to the Non-dominated Sorting Genetic Algorithm II (NSGA-II) [56]. The capability of NSGA-II has been demonstrated extensively in the MOMoT environment [3], which is the same we leverage in this work.
Table 3
Mean (\(\mu \)) HV and execution time, standard deviation (\(\sigma \)), and BOV observed over 30 runs for single-objective RL algorithms and NSGA-II(with MOMoT encoding)
Case
Alg.
HV
BOV
Time
\(\mu \)
\(\sigma \)
\(\mu \)
SLB-5
NSGA-II
.274
.032
.000
1.33s
\(Q\)
.290
.015
.000
5.48s
\(Q^+\)
.310
.009
.000
2.85s
SLB-10
NSGA-II
.655
.035
.748
5.28s
\(Q\)
.600
.043
.980
15.95s
\(Q^+\)
.707
.012
.400
9.00s
SLB-25
NSGA-II
.473
.036
5.762
40.71s
\(Q\)
.279
.021
10.327
71.64s
\(Q^+\)
.567
.011
3.688
39.67s
SLB-50
NSGA-II
.304
.027
19.235
208.70s
\(Q\)
.162
.015
24.809
307.17s
\(Q^+\)
.532
.010
15.297
171.52s
CRP-3S8
NSGA-II
.386
.008
8.
12.68s
\(Q\)
.394
.000
8.
64.28s
\(Q^+\)
.383
.021
8.
43.92s
CRP-5S20
NSGA-II
.026
.051
20.
175.99s
\(Q\)
.359
.046
20.
286.27s
\(Q^+\)
.266
.003
5.961
122.00s
CRP-10S40
NSGA-II
.258
.024
16.918
464.66s
\(Q\)
.434
.034
14.937
722.41s
\(Q^+\)
.344
.087
8.936
317.88s
CRA-9
NSGA-II
.192
.140
3.
4.35s
\(Q\)
.391
.015
3.
9.89s
\(Q^+\)
.403
.004
3.
5.45s
CRA-18
NSGA-II
.378
.076
2.700
19.94s
\(Q\)
.485
.031
.000
41.64s
\(Q^+\)
.629
.015
4.
23.61s
CRA-35
NSGA-II
.559
.040
-6.286
95.14s
\(Q\)
.481
.023
-18.250
214.98s
\(Q^+\)
.751
.008
.100
121.56s
CRA-80
NSGA-II
.566
.035
-32.651
703.33s
\(Q\)
.363
.013
-86.041
1560.07s
\(Q^+\)
.767
.008
-14.621
913.83s
Ref-1
NSGA-II
.449
.009
41.
19.49s
\(Q\)
.458
.006
41.
362.17s
\(Q^+\)
.462
.000
41.
175.98s
Ref-2
NSGA-II
.457
.024
26.
16.06s
\(Q\)
.477
.013
26.
66.31s
\(Q^+\)
.484
.003
26.
60.38s
BOV corresponds to the standard deviation (SLB), CRA-Index (CRA), retrieved containers (CRP), and content size (Refactoring - "Ref"). Bold numbers mark the best mean and BOV for each instance. The “+” after the algorithm name indicates that a local search was employed in the exploration phase (Alg. 2)
For the performance evaluation (RQ2) we consider the models that result from the transformation sequences produced by algorithms, henceforth referenced as solution or approximation sets. Having at least one case-specific objective in place and as we also consider the length of MT sequences, each solution has in any of our cases an objective vector in n-dimensional space. Specifically, due to considering the transformation length as an additional general objective, \(n=2\) for cases with a single quality objective (here referred to as single-objective) and \(n > 2\) for cases with multiple quality objectives (here referred to as multi-objective).
Considering these different settings, we use the HV indicator for a direct comparison of solution sets. The same is recommended by the authors of [23] when dealing with fewer than 10 objectives. As the reference set, we use the combined non-dominated front as is deemed common practice [23]. Furthermore, we are interested in the total optimization capability where there is only one quality objective in place, hence determining the best objective fitness value (BOV) after each algorithm run. The BOV corresponds to the best value observable in the produced solution set for the object of interest, which is for each case designated in bold in Table 2. Finally, we measure the execution times of the individual algorithms to obtain an estimate of the overhead of the RL techniques. This is seen in contrast to GAs, which have a diverse potential for parallelizing computations.
Due to the stochastic elements present in both meta-heuristic algorithms and our RL implementations, resulting solutions vary across multiple executions. To ascertain the significance of performance differences, we collect results over 30 independent experiment runs for each instance of each case study and conduct pairwise tests for all case studies, evaluating the effect size and testing for significance of the differences in performance. Regarding effect size, we analyze Vargha and Delaney’s \(\hat{A}_{12}\) statistic [57], which is suitable for non-normally distributed data samples [58], as we find them in our experiments. To this end, we calculate the pooled statistic on a case-study basis, averaging over the results for the instances of each case study. The statistic indicates a small (.5-.56), medium (.57-.64), large (.65-.71), or very large (.72-1) difference in favor of the algorithm in the column. In addition, statistical significance is determined applying the Mann-Whitney U test [59] as a non-parametric test, using a significance level \(\alpha = 0.05\). The null hypothesis states performance equality whereas the alternative hypothesis states that there is a significant difference in performance.
5.2.3 Parameter configuration
The algorithms require an appropriate configuration before they can be executed for each case study.
Each experiment is run 30 times, where each algorithm terminates after a predefined number of function evaluations (FEs): Up to 40,000 for SLB, 80,000 for the CRA problem, 20,000 for the CRP, and 1,000 for the Refactoring case. As performance indicators, we use the mean HV, runtime, and the best overall solutions per algorithm.
We use the NSGA-II that is implemented in MOMoT as a baseline for comparison against the learning-based algorithms, and for configuration adhere to the settings reported in previous case studies [3]: population size of 100; deterministic tournament selection (\(n=2\)); one-point crossover (\(p=1.0\)); two mutation operators to remove (\(p=0.1\)) and replace (\(p=0.05\)) a rule in the transformation sequence.
For RL algorithms we use a discount \(\gamma \)=0.9 to optimize towards shorter transformation sequences. We set \(\epsilon \)=1.0 for an exploration-heavy phase at the start and gradually decrease \(\epsilon \) to shift towards exploitation of transformations with increasing FEs until \(\epsilon \)=0.05. Our studies test the approaches from Section 4, in particular, Q-learning (\(Q\)), the Chebyshev scalarization function (\(Q_{Chebyshev}\)), Negotiated W-learning (\(Q_{W}\)), and PQL using the HV indicator (\(PQ\)-\(HV\)) and Pareto set evaluation (\(PQ\)-\(PO\)) mechanism, respectively. We enable the enhanced exploration strategy (Section 4.4) for instances of large combinatorial problems, therefore marking those instances with a superscript plus (“+”) following the algorithm name.
The search space is also determined by the sequence length (i.e., transformation steps) a solution comprises. It is also an essential parameter for the NSGA-II. We choose the maximum transformation length (\(TL_{max}\)) per case and complexity such that the optimal output model can be found if known, and otherwise set it to a length that enables far-reaching improvements.
5.2.4 Used Hardware & Software
Hardware All experiments were conducted on E2 virtual machines on the Google Cloud Platform, running a Intel(R) Xeon(R) CPU @ 2.20GHz with 2 cores on Debian GNU/Linux 12 OS and 32GB of physical memory.
Software Using Java 14.0.2 and Eclipse version 4.28.0, we limit memory usage to 20GB with initial heap size set to 1GB. Due to building on the toolset integrated in the MOMoT framework [3], the Eclipse Modeling Framework (v2.25) and Henshin SDK (v1.6.0) are required.
The extension for RL algorithms has been implemented to adhere to the interfaces of the MOEA framework, in particular the "AbstractAlgorithm" class, to enable the same usage pattern as with the GA encoding utilized for MDO. The algorithms and environment utilities were implemented as Java classes without a dedicated RL framework. To this end, we provide a unified interface for the following interaction pattern: execute a transformation rule (i.e., the action), receive the objective fitness value(s) of the model (i.e., the reward), and the transformed model (i.e., the successor state) as output. In doing so, we aim for comprehensibility and to facilitate further adaptation efforts, e.g., to implement new RL algorithms for MDO based on our framework.
Our tool extension, including all case study setups and used artifacts such as models and transformation rules, is available online [10].
5.3 Results
First, we present the results of running single-objective RL algorithms for the aforementioned cases in Section 5.2.1. Subsequently, the results of executing the multi-objective RL variants using separate objectives instead of a combined metric for the CRP and CRA problem are presented. In all experiments, we also run NSGA-II (MOMoT encoding) as a baseline.
Where we refer to the “optimal model”, we mean the model with the highest obtainable fitness for the case under our problem specification in Section 5.2.1.
5.3.1 Single-criteria evaluation
In addition to the transformation length, the results involve a single domain-specific objective. Empirical results are reported in Tables 3 and 4.
Stack load balancing
HV In the five-stack case, the RL algorithms turn out superior, albeit \(Q\) only to a marginal extent, and show higher consistency (\(\sigma = .015\) and \(\sigma = .009\)). For higher-stacked models, \(Q^+\) increasingly outperforms NSGA-II as it covers a much broader region in the objective space although NSGA-II claims some parts in the midsection for SLB-25 (Fig. 11a–c). This manifests for the larger models (SLB-25 and SLB-50) even when accounting for variations between runs. In this regard, \(Q^+\) also shows the most consistent performance. Meanwhile, \(Q\) falls off with increasing model size and accounting for deviations clearly below NSGA-II from SLB-25 onwards.
BOV The optimal distribution for the simplest case demands three steps, which all algorithms manage to find. For SLB-10, \(Q^+\) finds the solution for the most even load distribution with standard deviations over loads of .4. Moreover, facing increasing complexity with increasing stacks (SLB-25 and SLB-50), \(Q^+\) surpasses the GA in balancing the loads.
RuntimeNSGA-II is the fastest algorithm for smaller model sizes and the therefore defined maximum length for rule sequences. This advantage against the learning algorithms vanishes with more rules per individual in NSGA-II with longer mean execution times than \(Q^+\) for SLB-25 and SLB-50 with \(TL_{max}\) of 50 and 100, respectively.
Effect size and significance Results show superiority of \(Q^+\) for all instances with large or very large differences in the pairwise comparisons, whereas NSGA-II still vastly outperforms our basic Q-learning adoption, named \(Q\). Further, all comparisons show a significant difference, except for NSGA-II and \(Q\) on the SLB-5 model (\(p = .053\)).
Table 4
Mean test statistic (\(\hat{A}_{12}\)) to measure the effect size of the HV indicator, averaged over all instances of the corresponding case study, based on 30 independent runs.
NSGA-II
\(Q\)
\(Q^+\)
SLB
NSGA-II
–
0.201
0.951
\(Q\)
0.799
–
0.956
\(Q^+\)
0.049
0.044
–
CRP
NSGA-II
–
0.941
0.807
\(Q\)
0.059
–
0.136
\(Q^+\)
0.193
0.864
–
CRA
NSGA-II
–
0.472
0.996
\(Q\)
0.528
–
0.950
\(Q^+\)
0.004
0.050
–
Refactoring
NSGA-II
-
0.634
0.793
\(Q\)
0.366
–
0.667
\(Q^+\)
0.207
0.333
–
Highlighted values indicate a medium (underline), large (italic), or very large (bold) difference in favor of the column subject
CRP
HV Talking nominally, \(Q\) determines widespread trade-offs for the required moves and resulting emerging container retrieval, and thus, appears more effective than \(Q^+\) and NSGA-II. A look at the non-dominated solutions for each algorithm in Fig. 12a–c reveals dense regions from the learners for short transformation sequences, but also to a great extent for long sequences, particularly ones from \(Q\). Consequently, the GA only marginally reaches the space outlined by the combined non-dominated set, which reasons for the low performance on the mid-sized model. Despite the high coverage for \(Q^+\) and \(Q\), solutions determined for the later steps in the sequence are inferior to NSGA-II, which takes the lead on the BOV. The enhanced exploration subject to \(Q^+\) appears to be non-beneficial for this task. Although the \(Container-Index\) factors in the inaccessibility of containers yet to be retrieved, the greedy approach seems to neglect crucial steps for their retrieval later on.
Mean (\(\mu \)) HV and execution time, and standard deviation (\(\sigma \)) observed over 30 runs for multi-objective RL algorithms and NSGA-II(with MOMoT encoding)
Case
Alg.
HV
Time
\(\mu \)
\(\sigma \)
\(\mu \)
CRA-9
NSGA-II
.604
.039
4.23s
\(Q_{Chebyshev}\)+
.649
.008
5.84s
\(Q_{W}\)+
.653
.010
5.64s
\(PQ\)-\(HV\)+
.653
.007
5.93s
\(PQ\)-\(PO\)+
.654
.007
6.20s
CRA-18
NSGA-II
.492
.056
19.86s
\(Q_{Chebyshev}\)+
.586
.032
23.95s
\(Q_{W}\)+
.582
.029
24.04s
\(PQ\)-\(HV\)+
.585
.022
24.12
\(PQ\)-\(PO\)+
.575
.022
23.91s
CRA-35
NSGA-II
.474
.052
103.77s
\(Q_{Chebyshev}\)+
.426
.029
120.80s
\(Q_{W}\)+
.405
.024
121.30s
\(PQ\)-\(HV\)+
.407
.030
119.47s
\(PQ\)-\(PO\)+
.409
.022
118.93s
CRA-80
NSGA-II
.400
.045
885.00s
\(Q_{Chebyshev}\)+
.251
.014
933.23s
\(Q_{W}\)+
.249
.019
949.67s
\(PQ\)-\(HV\)+
.249
.019
969.47s
\(PQ\)-\(PO\)+
.253
.017
937.23s
CRP-3S8
NSGA-II
.232
.012
11.82s
\(Q_{Chebyshev}\)
.226
.012
30.44s
\(Q_{W}\)
.231
.011
30.63s
\(PQ\)-\(HV\)
.223
.010
28.43s
\(PQ\)-\(PO\)
.230
.011
24.13s
CRP-5S20
NSGA-II
.051
.028
95.67s
\(Q_{Chebyshev}\)
.267
.022
151.13s
\(Q_{W}\)
.292
.035
146.67s
\(PQ\)-\(HV\)
.262
.018
145.40s
\(PQ\)-\(PO\)
.277
.025
142.00s
CRP-10S40
NSGA-II
.374
.029
228.57s
\(Q_{Chebyshev}\)
.453
.034
389.63s
\(Q_{W}\)
.451
.026
379.43s
\(PQ\)-\(HV\)
.423
.032
387.00s
\(PQ\)-\(PO\)
.430
.026
382.90s
Bold numbers mark the best mean for each case. The “+” after the algorithm name indicates that a local search was employed in the exploration phase (Alg. 2)
BOV Although showing higher coverage in the objective space, BOV (\(14.94 < 16.92\)) indicates that \(Q\) lacks further exploration of its options later in the episodes with most progress occurring on the first 20-40 steps. The greedy approach in this task turns out not as suitable as the classic variant.
Runtime Considering all instances together, NSGA-II positions between Q-Learning with and without a local search in terms of runtime. \(Q^+\) shifts towards exploitation faster than Q, and thus, finishes a lot earlier. The GA spends time during initialization of initial solutions and later for lengthy repair operations which are expected to occur frequently due to infeasible relocation sequences from recombination.
Effect size and significance\(Q\) yields a large performance difference over both \(Q^+\) and NSGA-II, which also holds for \(Q^+\) in contrast to NSGA-II. Only for the simplest instance, CRP-3S8, the advantage of \(Q\) is insignificant (\(Q\) vs. NSGA-II, \(p=.212\))
CRA
HV Throughout all instances, \(Q^+\) provides the best trade-offs in assignments and model quality. With increasing features, the separation of dependencies works better by extensive exploration with \(Q^+\), which consistently outperforms \(Q\). Furthermore, the Pareto front approximation in Fig. 13a–c is for NSGA-II centered around transformations with medium length, but does not reach the extreme points. Additionally, \(\sigma \) shows the performance of \(Q^+\) as more consistent. \(Q\) yields better approximations for smaller models (CRA-9 and CRA-18), but loses against NSGA-II for the larger ones.
BOV Considering CRA-9, the optimal model is found universally. With increasing complexity, \(Q^+\) yields better solutions as shown by the BOV here embodying the CRA-Index. Performance of \(Q\) deteriorates with increasing features. In contrast, the NSGA-II finds better assignments although these are clearly not at the level of ones from \(Q^+\).
Runtime As observed before, the GA operates fast for short solution vectors but its runtime grows with increasing model size and more extensive MT sequences. The \(Q\) on average takes roughly two times as long as NSGA-II to instrument the reassignment steps for the CRA-9 instance. It is noteworthy that, in contrast to the genetic operators in the GA, the execution of \(Q^+\) and \(Q\) is independent of the sequence length of the transformation chains and therefore becomes faster than the GA for increasing model size and thus transformation length.
Effect size and significance Result differences are large to \(Q^+\), favoring it over the other algorithms, whereas no difference can be established between \(Q\) and NSGA-II. All tests prove significance, where \(Q\) outperforms NSGA-II on smaller-sized models (\(p < .000\) for CRA-9 and CRA-18, respectively), and vice versa, NSGA-II is preferable on larger instances (\(p < .000\) for CRA-35 and CRA-80, respectively)
Refactoring
HV Both learning algorithms very well learn to instrument the refactoring operators in the right order. In fact, all solutions along the Pareto front are found by all algorithms (Fig. 14a and b). This is even with reducing the FEs per execution to 1,000. While \(Q^+\) takes the edge in average performance and turns out more robust concerning solution set quality (\(\sigma = .0\) and \(\sigma = .003\)), NSGA-II shows slightly less effectiveness and higher deviations between runs.
BOV Optimal solutions are found universally (BOV = 26 and BOV = 41, for all three algorithms). Lower deviations in the approximation sets may indicate the learning algorithms to conduct the right refactoring steps with the least steps.
Runtime As before, the GA is faster than the learning algorithms with \(Q\) taking the longest. In the case of several different refactoring steps, the enhanced exploration in \(Q^+\) leads to much more increased execution times than random selection in \(Q\). This originates from the problem nature in connection with the three employed MT rules for refactoring. Thereby, often only a small number of specific changes can be carried out on the current model in terms of matching rule applications. However, the local search always searches for a certain number of rule applications including all rules, which often leads to unsuccessful and therefore lengthy matching attempts for a rule application on the model in the transformation engine.
Effect size and significance RL algorithms here show medium to very large differences to NSGA-II regarding performance with \(Q^+\) taking the edge overall. Significance is evident in all pairwise comparisons except for NSGA-II and \(Q\) on instance Ref-2 (\(p=.115\)).
5.3.2 Multi-criteria evaluation
In addition to the transformation length, the results involve more than one domain-specific objective. Empirical results are reported in Tables 5 and 6.
HV For the smaller models (CRA-9 and (CRA-18), \(PQ\)-\(PO\) and \(Q_{Chebyshev}\) yield the best set approximations, respectively, yet do only marginally differ considering variations and the other learning-based algorithms. As seen earlier, RL approximations are highly concentrated for shorter transformation sequences, as can also be observed in Fig. 15. The GA outperforms all multi-objective RL variants on the larger models but is subject to higher variance. However, the RL variants visibly deteriorate as the search space grows further with increasing model size.
Runtime The runtime turns out similar to what we observed for single-objective RL and NSGA-II. Larger models, for which we permit longer MT sequences, lead to increased runtimes for the GA but not to the same extent for the RL algorithms. Thus, the difference in execution time of RL against NSGA-II diminishes the more feature assignments a solution may contain.
Effect size and significance No notable performance differences are observable with the aggregated statistic, however, some are significant when comparing instance by instance. NSGA-II is outperformed on small instances (\(p < .000\) for CRA-9 and CRA-18, respectively, against all RL algorithms) but excels for larger ones (\(p < .000\) for CRA-35 and CRA-80, respectively, against all RL algorithms). \(Q_{Chebyshev}\) appears superior to all other RL algorithms for CRA-35 (\(p < .008 \)) but yields no significant advantage for CRA-80 (\(p > .344\)).
CRP
HV For the smallest instance, all learning-based algorithms find all solutions of the Pareto frontier consistently, which remarkably is only not the case for the GA (\(\sigma = .012\)). For the mid-sized model, the RL algorithms perform equally well considering deviations whereas trade-offs found by NSGA-II are inferior by a margin. This originates from the RL algorithms naturally engaging with a solution of each size continually. Hence we can observe these solutions close to the origin in Fig. 16, whereas the GA is not represented in this region. Given the variation observed in the largest model, no algorithm has a visible advantage in this task. Similarly, no differences in performance between the RL algorithms can be detected when the variations in all three cases are taken into account.
Table 6
Mean test statistic (\(\hat{A}_{12}\)) to measure the effect size of the HV indicator, averaged over all instances of the corresponding case study, based on 30 independent runs
\(Q_{Chebyshev}\)+
NSGA-II
\(PQ\)-\(HV\)+
\(PQ\)-\(PO\)+
\(Q_{W}\)+
CRA
\(Q_{Chebyshev}\)+
-
0.487
0.464
0.500
0.462
NSGA-II
0.513
-
0.495
0.494
0.488
\(PQ\)-\(HV\)+
0.536
0.505
-
0.530
0.500
\(PQ\)-\(PO\)+
0.500
0.506
0.470
-
0.477
\(Q_{W}\)+
0.538
0.512
0.500
0.523
-
\(Q_{Chebyshev}\)
NSGA-II
\(PQ\)-\(HV\)
\(PQ\)-\(PO\)
\(Q_{W}\)
CRP
\(Q_{Chebyshev}\)
-
0.208
0.366
0.497
0.620
NSGA-II
0.792
-
0.728
0.810
0.842
\(PQ\)-\(HV\)
0.634
0.272
-
0.658
0.756
\(PQ\)-\(PO\)
0.503
0.190
0.342
-
0.639
\(Q_{W}\)
0.380
0.158
0.244
0.361
-
Highlighted values indicate a medium (underline), large (italic), or very large (bold) difference in favor of the column subject
RuntimeNSGA-II executes substantially faster, taking roughly between a third and a half of the time compared to the RL algorithms dependent on the model size. This advantage decreases for larger instances along with larger MT sequences, whereas RL approaches are very close in this respect.
Fig. 15
Pareto front approximation for CRA-18 (3 objectives)
Effect size and significance All RL algorithms demonstrate a very large difference as indicated by the test statistic, favoring them over NSGA-II. Differences between RL algorithms range from medium to very large differences with no clear winner. Notably, NSGA-II still performs significantly better on instance CRP-3S8 against \(PQ\)-\(HV\) (\(p = .008\)), however, is outperformed on instances CRP-5S20 and CRP-10S40 (\(p < .000\), against all RL algorithms.
5.4 Discussion
This section discusses the results in light of the research aim and RQs stated in Section 5.1.
5.4.1 Answering RQ1: Applicability
The results from various case studies, which we conducted with an available prototype that we extended for a palette of RL algorithms, advocate for the feasibility of RL for in-place MTs. The motivation to aim for high-scoring outcomes resulting from interaction in a limited time horizon aligns well with in-place MTs, where we look for models that express high problem-specific fitness with a preferably small number of changes. The implemented algorithms offer general applicability but, as became evident from our case studies, face limits when the search space is too large for sufficient exploration. A critical factor emerges in our tabular methods as all carried-out MTs will be stored, which imposes a storage condition. Naturally, this also holds true for the multi-objective implementations. The rule applications are kept in memory at runtime to recognize previously carried out MTs and reconstruct transformed models at any given time.
An alternative strategy for retaining MTs encoded in a table-like structure involves acquiring and retaining information from past transformations by learning a parametric function. Consequently, storing explicit transformations becomes obsolete as the function is henceforth consulted for decision-making on the basis of the current (encoded) model. However, adjusting the parameters of this function requires a processable encoding for both the model and the rule applications.
5.4.2 Answering RQ2: Performance
For single-objective RL, the resulting models express fitness values for a similar or better approximation of the Pareto front in most cases, provided that an extensive exploration phase is granted. Measured by HV and BOV, our \(Q^+\) variant mostly finds the highest quality models. The plain Q-Learning approach (\(Q\)) exhibits limitations as the search space grows with increasing model sizes and longer transformation sequences are permitted with greater \(TL_{max}\). A significant factor in executing a randomly applicable rule in \(Q\) is that it can influence the subsequent options in the current learning episode. Indeed, executing a suboptimal MT at the beginning of an episode may enter a region in the solution space that is detrimental to the further development of the chain. In the best case, this can be compensated or reversed in subsequently selected MTs. At worst, the negative effect propagates into subsequent decisions, impeding the transition to better regions until the reset to the initial state is carried out at the end of the misguided sequence. On the other hand, \(Q\) suits whenever the consideration of all options is crucial to recognize dependencies inherent to the domain. The CRP demonstrates this in that important steps move one container to make another accessible, yet the primary reward is issued upon the following retrieval. Therefore, evaluating all possible moves in a greedy search is ineffective. It is important to note that the CRA problem is unaffected by the particular ordering. That is, whether a feature is assigned to a class sooner or later in the MT chain does not affect the output model. In such scenarios, where the search space is enormous and the resulting model is invariant to the order in the sequence, the greedy search demonstrates good performance as it fast-tracks the agent towards good regions early in an episode. Meanwhile, the random movement with \(Q\) leads to comparatively slow improvements in such fitness landscapes, as becomes evident in particular in Fig. 11b and c, but also Fig. 13b and c. Therein, non-dominated solutions of \(Q\) deviate further from the approximation set of \(Q^+\) with increasing sequence length and solutions are far from achieving the same quality overall.
Adopting multi-objective formulations for the CRP and CRA gives ambivalent results. The set quality measured by HV draws a similar picture for the CRP as when employing the combined metric where the best results are on average produced by the multi-objective learners. Nevertheless, the inspection of the non-dominated solutions produced by NSGA-II in Fig. 15 indicates convergence to optimal values, that is the “Container” dimension in this case. For feature assignment in the CRA, likewise, the GA finds many of the extreme points which appear off-limits for RL algorithms here. This is not congruent with single-objective RL results where \(Q^+\) clearly surpasses NSGA-II, which at this point indicates substantially higher potential for the single-objective formulation of the greedy approach. Ultimately, some problems seem to be more susceptible to rigorous exploration as is realized in GAs or also \(Q^+\). This is the case where there is a large design space for the model and few restrictions regarding the applicability of the transformation rules, such as the assignment of features to classes in the CRA example. Some problems, however, require a more fine-grained procedure. This becomes evident with the CRP for the aforementioned interdependencies relevant to solving the task, hence, each subsequent step must be carefully determined and a greedy search is counterproductive if the optimal solution includes steps that appear suboptimal in the short term. In such cases, conventional Q-learning (\(Q\)) can be used more effectively.
The incremental approach in all our implementations leads to longer execution times due to continuous interchange with the transformation engine for the next step. Thereby, in each iteration the engine is tasked to suggest feasible rule applications for the current model state. GAs, in contrast, consult the engine mostly once to generate the initial population and thereafter, only for computing mutations and repair operations. For recombination, the engine is not concerned. As a result, NSGA-II outruns RL competitors in almost all cases. Nevertheless, this gap in computation time decreases substantially as models increase in size for the more complex cases. The reason lies in the evaluation of solutions during initialization and mutation, which is less efficient than generating neighbors using the Henshin engine. Eventually, each newly generated transformation sequence must be executed to produce the solution model for its evaluation in the genetic algorithm, whereas with the RL approaches only one rule application is ever added. This is further taken advantage of when generating several individual application steps at once when running \(Q^+\).
Compared solutions emerged from learning episodes using a fixed transformation sequence length (\(TL_{max}\)) as termination criterion before resetting the environment to the initial model. This was to establish equal terms when comparing performances between RL algorithms and NSGA-II, which is naturally bound to a fixed size for solutions. Practically, there is no such requirement for a constant termination criterion. In fact, a dedicated strategy could leave it to the agent to determine valuable chain lengths and further explore models after transformations deemed particularly profitable. For instance, one could define a quality-based criterion to stop further model evaluation if a certain amount of recent transformations showed no improvements. Such a strategy may avoid the problematic scenario where one action hampers the progress for the whole remaining episode.
5.5 Threats to validity
The conducted case studies are to be seen with caveats regarding the design and generalizability of the results which we discuss next. We address those under the guidance of the four validity threats established by Wohlin et al. [60].
5.5.1 Conclusion validity
The RL algorithms were tested on several examples with models and transformation rules of different problem domains. All experiments involved 30 independent executions to estimate the consistency of results and minimize potential outlier effects. Furthermore, the small deviations observed among executions indicate a high level of reproducibility for the results. The execution runtimes were measured using the standard Java API (System.nanoTime()).
5.5.2 Construct validity
Construct validity questions the legitimacy of the chosen experiment setup in contrast to the stated research objective. In this regard, we evaluated the optimization capability with the BOV obtained with each algorithm and used the well-established HV indicator to assess trade-offs within produced solution sets. Admittedly, we only compared the final models produced in each case and, therefore, are not able to pinpoint the superiority observed in some of the results from learned agents. Indeed, we demonstrated the effect of an enhanced exploration phase with the proposed \(Q^+\). Nevertheless, a thorough analysis of the execution phases in which algorithms show most of their progress may justify a study on its own. The NSGA-II represents a well-established GA that was demonstrated to perform best of all tested algorithms in most cases within the MOMoT approach [3], hence why we considered it an appropriate baseline for our experiments. Moreover, the episode length for the RL algorithms has been fixed at the maximum transformation length, thus ensuring comparability to the transformation solutions found by the GA. However, it should be noted that other criteria could prove more conducive to performance.
5.5.3 Internal validity
Internal validity can be threatened by our case study selection, the choices for variables in the experimental setup, and stochastic elements in algorithms. We have demonstrated RL algorithms to tackle four problem formulations that are not amenable to exhaustive methods due to large or even infinite search spaces. The formulations contain either two or three objectives, where minimizing the transformation length is implicitly considered in RL algorithms. To compare the solution sets, we used HV, which is considered the most important indicator for comprehensive quality evaluation for studies like ours [23]. Our configurations ensure a balance between exploration and exploitation in terms of a common \(\epsilon \)-greedy strategy for RL algorithms. Furthermore, we allowed all tested algorithms the same number of FEs for each execution, although other quality-oriented termination criteria could have been beneficial for the learning agent, e.g., by conducting premature resets in poorly evolving episodes. Reproducibility is threatened as the Henshin Engine is set to propose applicable rules in a non-deterministic way4. That is, the engine randomly returns one rule considering all applicable rules for a model. This is essential for us to be able to generate a population of random rule sequences during initialization in the GA implementation, and further to enable exploration in RL algorithms. However, due to the low variance in our experiments with 30 repetitions, we do not consider it an issue.
5.5.4 External validity
The RL algorithms are integrated in the MOMoT framework which uses Henshin as MT language. The latter supports constructs in the transformation that include non-determinism whereas none are present in the case studies discussed in the present work. However, the implemented RL algorithms are generally feasible for application in stochastic environments. The agents in RL obtain and store a new state through exchange with the environment that receives applicable rules from the engine as a response to a transformable model. Therefore, environment and RL algorithms would need to be adapted to accommodate other model representations and MT languages. Moreover, the solution representation of MOMoT builds on the APIs of the MOEA framework. Although we used several cases of varying complexity levels, further experiments are required to identify strengths and weaknesses before adding RL to MDE toolboxes.
6 Related work
In this section, we discuss related work concerning the application of search-based techniques and learning-based techniques for MTs. First, we review existing work on search-based techniques for orchestrating transformation rules. Second, we elaborate on the usage of learning-based techniques in the context of MTs, which includes existing work on using RL for MTs. Finally, we highlight the differences between the existing related work to the present work of this paper.
6.1 Search-based approaches for model transformations
In the following, we elaborate on approaches that use search-based techniques for optimizing models by MTs. One pioneer work on this topic is the work by Kessentini et al. [61] on using meta-heuristic search techniques for realizing the idea of model transformation by example, i.e., deriving an optimal transformation for a model by considering already existing input/output model pairs. A generic encoding for performing meta-heuristic search on Ecore models is introduced in [62] and adapted by Efstathiou et al. in [63] to support multi-objective optimization and constraints to ensure semantically valid models. Several solutions to the CRA case study were presented at the Transformation Tool Contest in 2016 [64]. For instance, Lano et al. [65] leverage UML-RSDS, a toolset for model-based development with a GA integration. Born et al. [66] use Henshin rules and a custom optimization strategy based on GAs. Eickhoff et al. [67] present a graph-based approach to represent and evaluate possible assignment options. Transformation rules are used to generate new assignment clusters and use the A* search algorithm for efficient expansion with respect to the evaluation metric. In [68], the author uses a domain-specific language to convert the problem representation into a vector of variables that can be processed with evolutionary algorithms from the MOEA framework. Finally, the approach described in [66] uses a different formulation for the Henshin rules for the CRA case study and its own genetic operators for search.
Several other solutions for MDO build on domain-agnostic frameworks such as MDEOptimiser [1], VIATRA Optimiser [2], and MOMoT [3] (which is also used in this work). These approaches consider a set of transformation rules and a set of objectives when searching for good rule application sequences. The general approach in VIATRA Optimiser and MDEOptimiser is to use a transformation engine for model evolution and combine it with search-based algorithm implementations. The former implements NSGA-II and runs transformation chains as members of the population, the latter incorporates the MOEA framework, which supports various evolutionary algorithms, and uses models as individual candidate solutions. In addition, MOMoT [3] provides a set of local and global optimizers building also on the MOEA framework which supports the statistical analysis of the performance of the different algorithms. In this respect, the possibility of comparison with other techniques is ideal for comparison with learning-based approaches such as those discussed in this work. In addition to MDO approaches operating directly on model representations during the search [4], search-based program transformations have been investigated as well to produce semantically equivalent code when moving between different programming languages [69, 70].
6.2 Learning-based approaches for model transformations
The adoption of machine learning in various scientific fields also covers concrete applications in the field of MDE targeting the automation of different kinds of MTs. In the following, we discuss approaches that investigate out-place transformations as well as more problem-specific transformations such as model repair, model recommendation, and model generation.
Concerning the application of machine learning for out-place transformations, Burgueño et al. [71] employ a neural net architecture dedicated to translation purposes for automatically realizing out-place MTs from example input/output model pairs. By using this learning approach, there is no further need for the development of any MT code. They present their approach for model-to-model transformations, in particular for translation transformations, i.e., migrating one model from one language to another.
The authors in [72] tackle the requirement of having large training data sets of supervised machine learning approaches by using RL techniques for deriving transformation rules. They use a Q-learning approach for generating transformation rules between two different metamodels. They also rely on input/output pairs but require only a small data set in contrast to a large set of input/output model pairs which may be required for a supervised learning approach.
Besides the mentioned approaches which provide general support of out-place transformations based on input/output pairs, there are several approaches that focus on specific transformation types or scenarios. Iovino et al. [73] also use RL but for the particular context of model repair, i.e., to find repair actions to fix errors in the model. More precisely, they use RL to select the best step from a set of valid repair steps considering different qualitative aspects of domain models and according to the modeler’s preferences. This context is also investigated by Barriga et al. [74] who perform a comparative study including four different RL algorithms which are applied for model repair. Furthermore, the authors of [75] explore the potential of AI-powered model repair. They discuss several utilization options of AI but also challenges in model repair tasks such as bug prediction and verification, the learning of automated repair strategies based on past treatment examples, or recommender systems that support modelers to meet domain-specific requirements.
Another line of research on learning-based approaches for specific transformation types is concerned with model completion based on recommendation systems. Rocco et al. [76] present a GNN-based recommender system to assist modelers during the modeling process. By observing the modeler’s current activities, the recommender uses similar models in the training set to make recommendations for the next modeling steps. For the same task, an assistant based on natural language processing is proposed by the authors of [77] to derive automatic completion suggestions for the model under construction. The approach relies on contextual information related to the specific modeling task on one hand and on general information related to the business domain on the other hand. Furthermore, the modeler’s acceptance of proposed changes is used as feedback to improve future recommendations.
Clustering transformations for models based on learning approaches have been also proposed. Addressing the problem of insufficient training data, López et al. [78] investigate the construction of datasets for MDE tasks. They propose a semi-automatic method along with a tool where they use a clustering algorithm to identify similar models, thus facilitating classification. Likewise, metamodel classification is the objective in [79]. They present an automated labeling approach for this purpose.
Finally, RL and model-driven reverse engineering have been combined for microservice migration [80], and learning-based approaches have been utilized to analyze MTs such as predicting the performance of transformations [81].
6.3 Synopsis
In contrast to the discussed approaches, our work utilizes RL for the general field of rule-based in-place MTs. The aim is to find the optimal transformation of models under the premise that the adaptation possibilities are known (i.e., the rule set of adaptation possibilities is already available for the optimization process) and the objectives for selecting the optimal model(s) are clear. The resulting problem-independent framework makes the usage for possible applications flexible, e.g., it can handle single-objective formulations as well as multi-objective ones. Furthermore, it can be used in the development of models by fully delegating the design space exploration to the RL approaches or in the future in online scenarios by continuously learning as a decision-maker in changing system environments, e.g., when realizing digital twins. In future work, we also plan to analyze if the problem-specific transformation approaches, e.g., out-place transformations and model repairs may be formulated as in-place transformations, and can be framed in a way so that they are solvable by our presented framework.
7 Conclusion
In this paper, we have presented a general framework for applying RL to MDO problems. In particular, we collected and adopted a set of RL approaches offering different properties which is also integrated with the MDO engine MOMoT. Based on this tool support, we have evaluated the applicability and the performance of RL approaches for MDO–also in comparison with the current state-of-the-art based on meta-heuristic searchers such as genetic algorithms. This also makes MOMoT an experimental toolbox to compare the current advances of RL concerning the different dimensions the algorithms are offering and allows it to effectively deal with the problem of having many selection options in one step by incorporating the local greedy search approach. The latter showed especially promising performance for solving single-objective problems with RL. For the investigated multi-objective problems, the results are more diverse, but for the container relocation problem, RL could achieve promising results and improvements compared to genetic algorithms.
While the current results seem already promising, we foresee several lines of research in order to successfully adopt and further push RL for MDO in the future. One research direction we aim to explore is the application of deep reinforcement learning which may require dedicated encodings of models and MTs. Another research direction is about training specific optimization engines that are reusable for a large set of different input models without or with minimal retraining. Additionally, we also consider exchanging the greedy search in the future with other search approaches such as simulated annealing. On another note, we aim to study the hybridization of different algorithms such as training deep learners with meta-heuristics searchers. Finally, evaluating the potential of our rule-based RL encoding in contrast to native implementations of RL and model-based MDO approaches is also an important avenue for future work as well as applying RL for vertical and exogenous transformations.
Acknowledgements
This work was funded by the Austrian Federal Ministry for Digital and Economic Affairs and the National Foundation for Research, Technology and Development.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Martin Eisenberg
is a PhD candidate at the Institute of Business Informatics - Software Engineering at the Johannes Kepler University Linz. His research focuses on intelligent software systems and applied machine learning, with current work exploring model-based systems like cyber-physical systems and digital twins to advance adaptive automation and decision-making. For more information, please visit https://se.jku.at/martin-eisenberg.
Manuel Wimmer
is full professor and head of the Institute of Business Informatics - Software Engineering at the Johannes Kepler University Linz. His research interests comprise foundations of software engineering techniques and their application in domains such as tool interoperability, software modernization, cyber-physical systems, and Quantum computing. For more information, please visit https://www.se.jku.at/manuel-wimmer.
In addition to the value approximations \(Q_i(s,a)\), the Chebyshev metric considers the optimal value observed thus far for each objective dimension. It has been used successfully in evolutionary computation [82] and was tested for multi-objective RL in [39].
Compared to linear scalarization, the Chebyshev metric showed superior performance in discovering optimal solutions and a higher spread indicating more diverse solutions [39]. In Eq. A1, the \(z^*\) denotes a reference point in m-dimensional space to which the difference is calculated and weighted from every possible action. Thereby, \(z_i^*\) holds the best value that has been observed for objective i so far, plus a small constant value.
The linear combination is not feasible or appropriate in any case. For one considerable limitation, it exhibits optimal solutions in non-convex regions whereas the Chebyshev metrics may also uncover solutions in concave regions of the Pareto front [43, 83]. Moreover, insufficiencies that arise from the weight setting and the scalarization step are discussed in [6].
A.2 The "negotiation" procedure in W-learning
The action-selection process in W-learning [40] resembles a tournament where all agents compete by proposing the most rewarding action according to their learned value function \(Q_i\) for objective i. For each step it seeks the action that, if not selected for execution, leads to the largest decrease in long-term reward for that objective. The proceeding is depicted in Algorithm 3. First, a (randomly selected) agent suggests a transformation for the model s it deems optimal for its own objective. The other agents then determine as \(W_i\) the difference in expected reward between the optimal transformation for their respective objective \(Q_i(s,a_i)\) and the proposed one, \(Q_i(s,a_k)\). If the highest weight exceeds that of the current proposer (\(W_k\)), this agent receives the right to propose and all other agents refer to the new proposal in their evaluation. Otherwise, if the leading agent remains dominant, its selected transformation is executed.
Algorithm 3
Action selection in Negotiated W-learning (adopted from [40])
Negotiated W-learning represents a more versatile approach with the step-wise consideration of trade-offs between all objective criteria while not relying on predetermined user preferences. However, different magnitudes in the objective values are troublesome considering that the weights \(W_i\) evaluate absolute differences in the particular objective dimensions. Consequently, values that rank higher on the nominal scale take precedence in the negotiation process, thereby suggesting that the algorithm is best suited for normalized objectives.
A.3 Pareto Q-Learning: Updates and action selection
As depicted in Algorithm 4, a non-dominated set of vectors \(ND_t(s,a)\) is maintained for each state action pair [41]. After each transformation, the set is updated to the Pareto optimal set of vectors of the successor model \(Q_{set}(s',a')\). Updates are not made by adding the immediate and expected future rewards. Rather, the immediate reward vector \(\varvec{r}(s,a)\) and the set of future reward vectors \(ND_t(s,a)\) are updated separately. The Pareto set \(Q_{set}(s,a)\) in Eq. A2 fuses \(\varvec{r}(s,a)\) and \(ND_t(s,a)\) by forming their vector sum (\(\oplus \)) and decides for the best action during exploitation.
When selecting a previously encountered action, actions are compared using quality indicators, which are determined using one of three available set evaluation mechanisms with actions’ objective vectors in \(Q_{set}\). The Hypervolume Set Evaluation mechanism employs the HV indicator (cf. Section 5.2.2). Given the current state, after calculating the HV for each action’s \(Q_{set}\), the best action can be determined greedily as the one with the largest HV. As the reference point, the worst possible value can be taken for each objective considering all derivable policies. For Cardinality Set Evaluation, each executable action a is assigned a degree of domination equal to the number of Pareto dominating Q-vectors in \(Q_{set}(s,a)\). The selected action is the one with the highest number of Q-vectors in the non-dominated set of Q-vectors over all actions. With a similar but non-greedy strategy, Pareto Set Evaluation initially discards actions if the vectors in \(Q_{set}(s,a)\) are dominated with respect to the union of all actions’ vectors. Then selection is done randomly from the actions that are connected to one of the remaining vectors.
Burdusel, A., Zschaler, S., Strüber, D.: Mdeoptimiser: a search based model engineering tool. In: Babur, Ö., Strüber, D., Abrahão, S., Burgueño, L., Gogolla, M., Greenyer, J., Kokaly, S., Kolovos, D.S., Mayerhofer, T., Zahedi, M. (eds.) Proceedings of the 21st ACM/IEEE International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings, MODELS 2018, Copenhagen, Denmark, October 14-19, 2018, pp. 12–16. ACM, (2018). https://doi.org/10.1145/3270112.3270130
2.
Abdeen, H., Varró, D., Sahraoui, H.A., Nagy, A.S., Debreceni, C., Hegedüs, Á., Horváth, Á.: Multi-objective optimization in rule-based design space exploration. In: Crnkovic, I., Chechik, M., Grünbacher, P. (eds.) ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, Vasteras, Sweden - September 15 - 19, 2014, pp. 289–300. ACM, (2014). https://doi.org/10.1145/2642937.2643005
Hayes, C.F., Radulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., Verstraeten, T., Zintgraf, L.M., Dazeley, R., Heintz, F., Howley, E., Irissappane, A.A., Mannion, P., Nowé, A., Oliveira Ramos, G., Restelli, M., Vamplew, P., Roijers, D.M.: A practical guide to multi-objective reinforcement learning and planning. Auton. Agents Multi Agent Syst. 36(1), 26 (2022). https://doi.org/10.1007/S10458-022-09552-YCrossRef
7.
Barrett, T.D., Clements, W.R., Foerster, J.N., Lvovsky, A.I.: Exploratory combinatorial optimization with reinforcement learning. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 3243–3250. AAAI Press, (2020). https://aaai.org/ojs/index.php/AAAI/article/view/5723
8.
John, S., Burdusel, A., Bill, R., Strüber, D., Taentzer, G., Zschaler, S., Wimmer, M.: Searching for optimal models: comparing two encoding approaches. J. Object Technol. 18(3), 6–122 (2019). https://doi.org/10.5381/jot.2019.18.3.a6CrossRef
9.
Eisenberg, M., Pichler, H., Garmendia, A., Wimmer, M.: Towards reinforcement learning for in-place model transformations. In: 24th International Conference on Model Driven Engineering Languages and Systems, MODELS 2021, Fukuoka, Japan, October 10-15, 2021, pp. 82–88. IEEE, (2021). https://doi.org/10.1109/MODELS50736.2021.00017
Brambilla, M., Cabot, J., Wimmer, M.: Model-Driven Software Engineering in Practice, Second Edition. Synthesis Lectures on Software Engineering. Morgan & Claypool Publishers, (2017). https://doi.org/10.2200/S00751ED2V01Y201701SWE004
Kessentini, M., Langer, P., Wimmer, M.: Searching models, modeling search: On the synergies of SBSE and MDE. In: Paige, R.F., Harman, M., Williams, J.R. (eds.) 1st International Workshop on Combining Modelling and Search-Based Software Engineering, CMSBSE@ICSE 2013, San Francisco, CA, USA, May 20, 2013, pp. 51–54. IEEE Computer Society, (2013). https://doi.org/10.1109/CMSBSE.2013.6604438
Zschaler, S., Mandow, L.: Towards model-based optimisation: Using domain knowledge explicitly. In: Milazzo, P., Varró, D., Wimmer, M. (eds.) Software Technologies: Applications and Foundations - STAF 2016 Collocated Workshops: DataMod, GCM, HOFM, MELO, SEMS, VeryComp, Vienna, Austria, July 4-8, 2016, Revised Selected Papers. Lecture Notes in Computer Science, vol. 9946, pp. 317–329. Springer, (2016). https://doi.org/10.1007/978-3-319-50230-4_24
17.
Strüber, D.: Generating efficient mutation operators for search-based model-driven engineering. In: Guerra, E., Brand, M. (eds.) Theory and Practice of Model Transformation - 10th International Conference, ICMT@STAF 2017, Marburg, Germany, July 17-18, 2017, Proceedings. Lecture Notes in Computer Science, vol. 10374, pp. 121–137. Springer, (2017). https://doi.org/10.1007/978-3-319-61473-1_9
18.
Burdusel, A., Zschaler, S., John, S.: Automatic generation of atomic consistency preserving search operators for search-based model engineering. In: Kessentini, M., Yue, T., Pretschner, A., Voss, S., Burgueño, L. (eds.) 22nd ACM/IEEE International Conference on Model Driven Engineering Languages and Systems, MODELS 2019, Munich, Germany, September 15-20, 2019, pp. 106–116. IEEE, (2019). https://doi.org/10.1109/MODELS.2019.00-10
19.
Horcas, J.M., Strüber, D., Burdusel, A., Martinez, J., Zschaler, S.: We’re not gonna break it! consistency-preserving operators for efficient product line configuration. IEEE Trans. Software Eng. 49(3), 1102–1117 (2023). https://doi.org/10.1109/TSE.2022.3171404CrossRef
Lúcio, L., Amrani, M., Dingel, J., Lambers, L., Salay, R., Selim, G.M.K., Syriani, E., Wimmer, M.: Model transformation intents and their properties. Softw. Syst. Model. 15(3), 647–684 (2016). https://doi.org/10.1007/s10270-014-0429-xCrossRef
Li, M., Chen, T., Yao, X.: How to evaluate solutions in pareto-based search-based software engineering: a critical review and methodological guidance. IEEE Trans. Softw. Eng. 48(5), 1771–1799 (2022). https://doi.org/10.1109/TSE.2020.3036108CrossRef
24.
Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., Fonseca, V.G.: Performance assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol. Comput. 7(2), 117–132 (2003). https://doi.org/10.1109/TEVC.2003.810758CrossRef
25.
Steinberg, D., Budinsky, F., Merks, E., Paternostro, M.: EMF: Eclipse Modeling Framework. Addison Wesley, (2008)
26.
Arendt, T., Biermann, E., Jurack, S., Krause, C., Taentzer, G.: Henshin: Advanced concepts and tools for in-place EMF model transformations. In: Petriu, D.C., Rouquette, N., Haugen, Ø. (eds.) Model Driven Engineering Languages and Systems - 13th International Conference, MODELS 2010, Oslo, Norway, October 3-8, 2010, Proceedings, Part I. Lecture Notes in Computer Science, vol. 6394, pp. 121–135. Springer, (2010). https://doi.org/10.1007/978-3-642-16145-2_9
Yang, R., Sun, X., Narasimhan, K.: A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 14610–14621 (2019). https://proceedings.neurips.cc/paper/2019/hash/4a46fbfca3f1465a27b210f4bdfe6ab3-Abstract.html
35.
Li, K., Zhang, T., Wang, R.: Deep reinforcement learning for multi-objective optimization. CoRR abs/1906.02386 (2019) URL:1906.02386
36.
Zhang, H., Yu, T.: Taxonomy of reinforcement learning algorithms. In: Dong, H., Ding, Z., Zhang, S. (eds.) Deep Reinforcement Learning: Fundamentals, Research and Applications, pp. 125–133. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4095-0_3
37.
AlMahamid, F., Grolinger, K.: Reinforcement learning algorithms: An overview and classification. CoRR abs/2209.14940 (2022)https://doi.org/10.48550/ARXIV.2209.14940 URL:2209.14940
Marler, R., Arora, J.: The weighted sum method for multi-objective optimization: New insights. Structural and Multidisciplinary Optimization 41, 853–862 (2010) https://doi.org/10.1007/s00158-009-0460-7
44.
Karlsson, J.: Learning to solve multiple goals. PhD thesis, USA (1997). UMI Order No. GAX97-28447
Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y., Hafner, D., Bohez, S., Vanhoucke, V.: Sim-to-real: Learning agile locomotion for quadruped robots. In: Kress-Gazit, H., Srinivasa, S.S., Howard, T., Atanasov, N. (eds.) Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018 (2018). https://doi.org/10.15607/RSS.2018.XIV.010 . http://www.roboticsproceedings.org/rss14/p10.html
48.
Peng, X.B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Sim-to-real transfer of robotic control with dynamics randomization. In: 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, pp. 1–8. IEEE, (2018). https://doi.org/10.1109/ICRA.2018.8460528
49.
Andrychowicz, M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., Zaremba, W.: Learning dexterous in-hand manipulation. Int. J. Robot. Res. (2020). https://doi.org/10.1177/0278364919887447CrossRef
50.
Runeson, P., Höst, M.: Guidelines for conducting and reporting case study research in software engineering. Empir. Softw. Eng. 14(2), 131–164 (2009). https://doi.org/10.1007/S10664-008-9102-8CrossRef
Eisenberg, M., Lehner, D., Sindelár, R., Wimmer, M.: Towards reactive planning with digital twins and model-driven optimization. In: Margaria, T., Steffen, B. (eds.) Leveraging Applications of Formal Methods, Verification and Validation. Practice - 11th International Symposium, ISoLA 2022, Rhodes, Greece, October 22-30, 2022, Proceedings, Part IV. Lecture Notes in Computer Science, vol. 13704, pp. 54–70. Springer, (2022). https://doi.org/10.1007/978-3-031-19762-8_5
53.
Bowman, M., Briand, L.C., Labiche, Y.: Solving the class responsibility assignment problem in object-oriented analysis with multi-objective genetic algorithms. IEEE Trans. Softw. Eng. 36(6), 817–837 (2010). https://doi.org/10.1109/TSE.2010.70CrossRef
54.
Fleck, M., Troya, J., Wimmer, M.: The class responsibility assignment case. In: García-Domínguez, A., Krikava, F., Rose, L.M. (eds.) Proceedings of the 9th Transformation Tool Contest, Co-located with the 2016 Software Technologies: Applications and Foundations (STAF 2016), Vienna, Austria, July 8, 2016. CEUR Workshop Proceedings, vol. 1758, pp. 1–8. CEUR-WS.org, (2016). https://ceur-ws.org/Vol-1758/paper1.pdf
55.
Lano, K., Rahimi, S.K.: Case study: Class diagram restructuring. In: Gorp, P.V., Rose, L.M., Krause, C. (eds.) Proceedings Sixth Transformation Tool Contest, TTC 2013, Budapest, Hungary, 19-20 June, 2013. EPTCS, vol. 135, pp. 8–15 (2013). https://doi.org/10.4204/EPTCS.135.2
56.
Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evolut. Comput. 6(2), 182–197 (2002). https://doi.org/10.1109/4235.996017CrossRef
57.
Vargha, A., Delaney, H.D.: A critique and improvement of the “CL’’ common language effect size statistics of McGraw and Wong. J. Educ. Behav. Stat. 25(2), 101–132 (2000)
58.
Li, J.C.: Effect size measures in a two-independent-samples case with nonnormal and nonhomogeneous data. Behav. Res. Methods 48(4), 1560–1574 (2016). https://doi.org/10.3758/s13428-015-0667-zCrossRef
59.
Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18(1), 50–60 (2024)MathSciNetCrossRef
Kessentini, M., Sahraoui, H.A., Boukadoum, M.: Model transformation as an optimization problem. In: Czarnecki, K., Ober, I., Bruel, J., Uhl, A., Völter, M. (eds.) Model Driven Engineering Languages and Systems, 11th International Conference, MoDELS 2008, Toulouse, France, September 28 - October 3, 2008. Proceedings. Lecture Notes in Computer Science, vol. 5301, pp. 159–173. Springer, (2008). https://doi.org/10.1007/978-3-540-87875-9_12
62.
Williams, J.R.: A novel representation for search-based model-driven engineering. PhD thesis, University of York, UK (2013). http://etheses.whiterose.ac.uk/5155/
63.
Efstathiou, D., Williams, J.R., Zschaler, S.: Crepe complete: Multi-objective optimization for your models. In: Paige, R.F., Kessentini, M., Langer, P., Wimmer, M. (eds.) Proceedings of the First International Workshop on Combining Modelling with Search- and Example-Based Approaches Co-located with 17th International Conference on Model Driven Engineering Languages and Systems (MODELS 2014), Valencia, Spain, September 28, 2014. CEUR Workshop Proceedings, vol. 1340, pp. 25–34. CEUR-WS.org, (2014). http://ceur-ws.org/Vol-1340/paper4.pdf
64.
García-Domínguez, A., Krikava, F., Rose, L.M. (eds.): Proceedings of the 9th Transformation Tool Contest, Co-located with the 2016 Software Technologies: Applications and Foundations (STAF 2016), Vienna, Austria, July 8, 2016. CEUR Workshop Proceedings, vol. 1758. CEUR-WS.org, (2016). http://ceur-ws.org/Vol-1758
65.
Lano, K., Tehrani, S.Y., Rahimi, S.K.: Solving the class responsibility assignment case with UML-RSDS. In: García-Domínguez, A., Krikava, F., Rose, L.M. (eds.) Proceedings of the 9th Transformation Tool Contest, Co-located with the 2016 Software Technologies: Applications and Foundations (STAF 2016), Vienna, Austria, July 8, 2016. CEUR Workshop Proceedings, vol. 1758, pp. 9–14. CEUR-WS.org, (2016). http://ceur-ws.org/Vol-1758/paper2.pdf
66.
Born, K., Schulz, S., Strüber, D., John, S.: Solving the class responsibility assignment case with henshin and a genetic algorithm. In: García-Domínguez, A., Krikava, F., Rose, L.M. (eds.) Proceedings of the 9th Transformation Tool Contest, Co-located with the 2016 Software Technologies: Applications and Foundations (STAF 2016), Vienna, Austria, July 8, 2016. CEUR Workshop Proceedings, vol. 1758, pp. 45–54. CEUR-WS.org, (2016). http://ceur-ws.org/Vol-1758/paper8.pdf
67.
Eickhoff, C., Raesch, L., Zündorf, A.: The SDMLib solution to the class responsibility assignment case for TTC2016. In: García-Domínguez, A., Krikava, F., Rose, L.M. (eds.) Proceedings of the 9th Transformation Tool Contest, Co-located with the 2016 Software Technologies: Applications and Foundations (STAF 2016), Vienna, Austria, July 8, 2016. CEUR Workshop Proceedings, vol. 1758, pp. 27–32. CEUR-WS.org, (2016). http://ceur-ws.org/Vol-1758/paper5.pdf
68.
Krikava, F.: Solving the TTC’16 class responsibility assignment case study with SIGMA and multi-objective genetic algorithms. In: García-Domínguez, A., Krikava, F., Rose, L.M. (eds.) Proceedings of the 9th Transformation Tool Contest, Co-located with the 2016 Software Technologies: Applications and Foundations (STAF 2016), Vienna, Austria, July 8, 2016. CEUR Workshop Proceedings, vol. 1758, pp. 55–60. CEUR-WS.org, (2016). http://ceur-ws.org/Vol-1758/paper9.pdf
69.
Fatiregun, D., Harman, M., Hierons, R.M.: Search based transformations. In: Cantú-Paz, E., Foster, J.A., Deb, K., Davis, L., Roy, R., O’Reilly, U., Beyer, H., Standish, R.K., Kendall, G., Wilson, S.W., Harman, M., Wegener, J., Dasgupta, D., Potter, M.A., Schultz, A.C., Dowsland, K.A., Jonoska, N., Miller, J.F. (eds.) Genetic and Evolutionary Computation - GECCO 2003, Genetic and Evolutionary Computation Conference, Chicago, IL, USA, July 12-16, 2003. Proceedings, Part II. Lecture Notes in Computer Science, vol. 2724, pp. 2511–2512. Springer, (2003). https://doi.org/10.1007/3-540-45110-2_154
70.
Fatiregun, D., Harman, M., Hierons, R.M.: Evolving transformation sequences using genetic algorithms. In: 4th IEEE International Workshop on Source Code Analysis and Manipulation (SCAM 2004), 15-16 September 2004, Chicago, IL, USA, pp. 66–75. IEEE Computer Society, (2004). https://doi.org/10.1109/SCAM.2004.11
71.
Burgueño, L., Cabot, J., Li, S., Gérard, S.: A generic LSTM neural network architecture to infer heterogeneous model transformations. Softw. Syst. Model. 21(1), 139–156 (2022). https://doi.org/10.1007/S10270-021-00893-YCrossRef
72.
Brilhault, Q., Yahia, E., Roucoules, L.: Digital continuity based on reinforcement learning model transformations. In: Gerbino, S., Lanzotti, A., Martorelli, M., Mirálbes Buil, R., Rizzi, C., Roucoules, L. (eds.) Advances on Mechanics, Design Engineering and Manufacturing IV, pp. 442–453. Springer, Cham (2023)CrossRef
Barriga, A., Mandow, L., Pérez-de-la-Cruz, J., Rutle, A., Heldal, R., Iovino, L.: A comparative study of reinforcement learning techniques to repair models. In: Guerra, E., Iovino, L. (eds.) MODELS ’20: ACM/IEEE 23rd International Conference on Model Driven Engineering Languages and Systems, Virtual Event, Canada, 18-23 October, 2020, Companion Proceedings, pp. 47–1479. ACM, (2020). https://doi.org/10.1145/3417990.3421395
75.
Barriga, A., Rutle, A., Heldal, R.: AI-powered model repair: an experience report - lessons learned, challenges, and opportunities. Softw. Syst. Model. 21(3), 1135–1157 (2022). https://doi.org/10.1007/s10270-022-00983-5CrossRef
76.
Rocco, J.D., Sipio, C.D., Ruscio, D.D., Nguyen, P.T.: A GNN-based recommender system to assist the specification of metamodels and models. In: 24th International Conference on Model Driven Engineering Languages and Systems, MODELS 2021, Fukuoka, Japan, October 10-15, 2021, pp. 70–81. IEEE, (2021). https://doi.org/10.1109/MODELS50736.2021.00016
77.
Burgueño, L., Clarisó, R., Gérard, S., Li, S., Cabot, J.: An NLP-based architecture for the autocompletion of partial domain models. In: Rosa, M.L., Sadiq, S.W., Teniente, E. (eds.) Advanced Information Systems Engineering - 33rd International Conference, CAiSE 2021, Melbourne, VIC, Australia, June 28 - July 2, 2021, Proceedings. Lecture Notes in Computer Science, vol. 12751, pp. 91–106. Springer, (2021). https://doi.org/10.1007/978-3-030-79382-1_6
78.
López, J.A.H., Izquierdo, J.L.C., Cuadrado, J.S.: Modelset: a dataset for machine learning in model-driven engineering. Softw. Syst. Model. 21(3), 967–986 (2022). https://doi.org/10.1007/s10270-021-00929-3CrossRef
79.
Nguyen, P.T., Rocco, J.D., Iovino, L., Ruscio, D.D., Pierantonio, A.: Evaluation of a machine learning classifier for metamodels. Softw. Syst. Model. 20(6), 1797–1821 (2021). https://doi.org/10.1007/s10270-021-00913-xCrossRef
80.
Dehghani, M., Rahimi, S.K., Tisi, M., Tamzalit, D.: Facilitating the migration to the microservice architecture via model-driven reverse engineering and reinforcement learning. Softw. Syst. Model. 21(3), 1115–1133 (2022). https://doi.org/10.1007/S10270-022-00977-3CrossRef
81.
Groner, R., Bellmann, P., Höppner, S., Thiam, P., Schwenker, F., Tichy, M.: Predicting the Performance of ATL Model Transformations. In: Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering. ICPE ’23, pp. 77–89. Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3578244.3583727
82.
Voß, T., Beume, N., Rudolph, G., Igel, C.: Scalarization versus indicator-based selection in multi-objective CMA evolution strategies. In: Proceedings of the IEEE Congress on Evolutionary Computation, CEC 2008, June 1-6, 2008, Hong Kong, China, pp. 3036–3043. IEEE, (2008). https://doi.org/10.1109/CEC.2008.4631208
83.
Vamplew, P., Yearwood, J., Dazeley, R., Berry, A.: On the limitations of scalarisation for multi-objective reinforcement learning of pareto fronts. In: Wobcke, W., Zhang, M. (eds.) AI 2008: Advances in Artificial Intelligence, 21st Australasian Joint Conference on Artificial Intelligence, Auckland, New Zealand, December 1-5, 2008. Proceedings. Lecture Notes in Computer Science, vol. 5360, pp. 372–378. Springer, (2008). https://doi.org/10.1007/978-3-540-89378-3_37