In this section we focus on the formalisation of the notion of moral value and how it can be translated to rewards in a Reinforcement Learning scenario. First, in ‘
Philosophical foundations’ we dive into the philosophy literature to identify the fundamental components of a moral value. Based on such findings, in
Moral value specification’ we propose a novel formalisation of the notion of moral value as our approach to tackle the aforementioned ethical challenge of the value alignment problem. Then, we proceed to tackle the technical challenge of the value alignment problem, and in ‘
From values to rewards’ we detail how to derive rewards from this definition. Finally, ‘
Formal discussion on the soundness of the proposed solution’ is devoted to prove that our specification of rewards is sound, that is, they indeed translate our moral value formalisation.
Philosophical foundations
Ethics or moral philosophy is the branch of philosophy that studies goodness and right action (Audi,
1999; Cooper,
1993; Fieser & Dowden,
2000; Frankena,
1973). Citing (Audi,
1999):
Correlatively, its principal substantive questions are what ends we ought, as fully rational human beings, to choose and pursue. Thus, right action becomes closely related to the the core concept of moral value, which expresses the moral objectives
worth striving for (van de Poel & Royakkers,
2011).
Prescribing how people ought to act is the subject of study of prescriptive ethics.
Prescriptive ethics (also known as
normative ethics), constitutes one of the main areas of research in ethics. Three of the most well-known types of prescriptive ethical theories are: virtue ethics, consequentialist ethics, and duty ethics.
-
Virtue ethics (developed by Socrates, Plato and Aristotle among other ancient Greek philosophers) states that by honing virtuous
1 habits –such as being honest, just, or generous– people will likely make the right choice when faced with ethical challenges (van de Poel & Royakkers,
2011).
-
Consequentialist ethics holds that actions must be morally judged depending on their consequences. For example, in utilitarianism (developed by Jeremy Bentham and John Stuart Mill in its classical form), actions are judged in function of how much
pleasure (utility) or pain they cause. To act ethically is to act in a way that maximises the amount of goodness for the largest number of people (van de Poel & Royakkers,
2011).
-
Duty ethics (or deontology, from the Greek deon, which means duty) states that an action is good if it is in agreement with a moral duty
2 that is applicable in itself, regardless of its consequences (van de Poel & Royakkers,
2011). Examples of duty ethics include Immanuel Kant’s theory or the Divine Commands theory, (in which for instance we find the moral norm of “thou shalt not kill”, under any circumstance).
It is important to remark that all these ethical theories are not opposing theories we need to choose from. They are all complementary and must be all taken into account (Camps,
2013). For that reason, in this paper we aim at a formal definition of moral value that can be compatible with any of these ethical theories.
What all these prescriptive ethical theories share in common is that they were developed in historical contexts in which all actions were assumed to fall in either one of the following three categories (Heyd,
2016):
1.
Actions morally obliged because they are good to do.
2.
Actions morally prohibited because they are bad to do.
3.
Actions permitted because they are neither good nor bad to do.
That is, these theories translated
evaluative notions (an action is either good, bad, or neutral) into
normative notions (an action is either obliged, prohibited or permitted). However, in the last century, an ethical discussion has developed around the existence of a fourth category (Chisholm,
1963; Urmson,
1958):
4.
Actions that are good to do, but not morally obligatory.
These are actions that go
beyond the call of duty (Urmson,
1958), such as beneficence or charity, are termed
supererogatory actions.
This fourth category implies that the normative dimension alone is not enough to categorise actions morally. Thus, in order to fully judge an action morally, it is required to look at it from these two
dimensions, as argued by Chisholm (
1963), Frankena (
1973), Etzioni and Etzioni (
2016): (1) a
deontic or normative dimension, considering whether it should be morally obliged, permitted, or prohibitted; and (2) an
axiological or evaluative dimension, that considers how praiseworthy or blameworthy it is.
Therefore, as argued by Heyd (
2016), the deontic dimension deals with the minimal conditions for morality, while the axiological dimension aims at higher (ethical) ideals which can only be commended and recommended but not strictly required.
In conclusion, we consider moral values as principles for discerning between right and wrong actions, and, moreover, we argue that they must be endowed with a normative and an evaluative dimension. Any action will thus need to be considered from these two ethical dimensions, in order to fully consider the four action categories identified above.
Moral value specification
As we just mentioned, we formalise moral values with two dimensions: a normative one and an evaluative one.
In the normative dimension, we formalise the moral norms that promote “good” actions and forbid “bad” actions (for example: “it is morally prohibited to kill others”
3). These moral norms constitute the minimum that an agent should align with in order to co-inhabit with humans, as explained in Amodei et al. (
2016), Leike et al. (
2017).
Conversely, in the evaluative dimension we formalise how good or bad each action is. These two dimensions may not always apply to the same set of possible actions, since some actions may be evaluated as good without being obligatory (and this is specially the case for supererogatory actions)
4. In this paper we consider that an agent that performs those actions as value-aligned, following the same direction that Gabriel and Sutrop (Gabriel,
2020; Sutrop,
2020).
Notice that, since we will ethically evaluate actions, it is important to also consider the context where they are performed when doing so. For instance, consider the action of performing an abortion to a woman that has already agreed to abort. The context where it takes place dictates how blameworthy or praiseworthy it is: performing it in many Western European countries is not seen as blameworthy, whereas in many other countries it is seen even as very blameworthy and even morally (and legally) prohibited. In the next subsection we will see that this connection between contexts and actions is especially relevant in Reinforcement Learning, for which contexts receive the name of states.
In summary, in addition to the normative dimension –by which each value is defined in terms of the norms that promote good actions with respect to the value–, we will also include in our moral value definition an action evaluation function that enriches our ethical system with an evaluative perspective.
Therefore, we next introduce our formal definition of value, which includes these two dimensions as two value components (i.e., norms promoting the value and an action evaluation function). We adopt our definition of moral value from Rodriguez-Soto et al. (
2020).
Observe that a moral value contains those norms that promote it, but our definition goes beyond norms, since the action evaluation function encapsulates knowledge about actions morally good but not obligatory. Moreover, it is worth noticing that we assume the moral value is defined so that it does not contain mutually exclusive (contradictory) norms. If that was the case, it would mean that the moral value encompasses genuine (unsolvable) moral dilemmas (for more information on moral dilemmas, see for instance (Conee,
1982; Zimmerman,
1987)). Moreover, paraphrasing Russell in Russell (
2019), if for a given situation there is a true moral dilemma, then there are good arguments for all the possible solutions to it, and therefore artificial agents cannot cause more harm than humans even if they take a wrong decision. Hence, here we adhere to Russell’s reasoning and disregard moral dilemmas.
Since one of our objectives was the characterisation of ethical behaviour, we can now do so from the definition of moral value
v. We expect an ethical agent to abide by all the norms of
v while also behaving as praiseworthily as possible
5 according to
v. Formally:
From values to rewards
We now proceed to explain our approach for the first step of the value alignment process: the reward specification. Specifically, we detail how to adapt our formal definition of a moral value into a reward function of a Reinforcement Learning environment. Our approach consists on presenting the individual and the ethical objectives of the agent as two separate reward functions of a Multi-Objective MDP, as Fig.
1 illustrates.
As previously mentioned in ‘
Dealing with the value alignment problem’, we formalise the agent learning environment as a Markov Decision Process (MDP)
\({\mathcal {M}}\), which can have one or multiple objectives (MOMDP). States of such environment
\({\mathcal {M}}\) are defined as a set
S. Moreover, for each state
\(s\in {\mathcal {S}}\), we consider
\({\mathcal {A}}(s)\) to be the set of actions that the agent can perform in
s. Then, the performance of a specific action
a in a state
s is rewarded according to each objective in
\({\mathcal {M}}\). We notate this by means of the reward function
\(R_i(s,a)\), which returns a real number –either positive or negative– with respect to the
i-th objective in
\({\mathcal {M}}\).
This way, we associate how praiseworthy or blameworthy an action is with a reward from a so-called ethical reward function. Therefore, we can formalise the ethical reward specification problem as that of computing a reward function \(R_v\) that, if the agent learns to maximise it, the learnt behaviour is aligned with the moral value v. Formally:
We solve this problem by mapping the two components of a moral value (\({\mathcal {N}}_v\) and \(E_v\)) into two different reward components (\(R_{{\mathcal {N}}}\) and \(R_E\), respectively) that we combine to obtain the ethical reward function \(R_v = R_{{\mathcal {N}}} + R_E\).
On the one hand, we create the normative component \(R_{\mathcal {N}}\) through two main steps: firstly, we identify which action-state pairs do represent violations of the norms in \({\mathcal {N}}_v\), and define the corresponding penalties; and, secondly, we aggregate all these penalties into the normative reward function.
Thus, we first formalise the
Penalty function for a norm
n as the function
\(P_n\) that returns -1 whenever performing action
a in state
s represents a violation of the norm. Therefore, in fact, non-compliance stems from either performing a forbidden action or from failing to perform an obliged action. Our definition of the Penalty function is based on the one present in Rodriguez-Soto et al. (
2020), adapted here for contextualised actions.
Second, we consider all norms in \({\mathcal {N}}_v\) and aggregate their penalties into a normative reward function \(R_{\mathcal {N}}\) that adds these penalties for each state-action pair. Formally:
The Normative reward function here present is a direct adaptation for MDPs of the one present in Rodriguez-Soto et al. (
2020), which was designed for Markov games.
On the other hand, we translate the action evaluation function
\(E_v\) in the moral value (see Definition
1) into the evaluative component
\(R_E\) in
\(R_v\) by (positively) rewarding praiseworthy actions. Formally:
The Evaluative reward function here present is an adaptation for MDPs of the one present in Rodriguez-Soto et al. (
2020), which was designed for Markov games.
Notice that our evaluative reward function definition implies that \(E_v\) need not be defined for all the actions of an MDP. The environment designer just needs to define it for those that they explicitly consider praiseworthy to perform. Thus, from a pragmatic perspective, the environment designer must only focus on specifying \(R_E\) for a limited subset of state-action pairs out of all the possible ones in the MDP.
Moreover, it is worth mentioning that we set a reward of 0 to any action that is not praiseworthy to perform –including those that are blameworthy but still permitted– not to further restrict the choices of the learning agent.
We are now capable of formally defining the ethical reward function
\(R_v\) in terms of previous definitions of
\(R_{\mathcal {N}}\) and
\(R_E\). Following the Ethics literature (Chisholm,
1963; Etzioni & Etzioni,
2016; Frankena,
1973; van de Poel & Royakkers,
2011), we consider
\(R_{\mathcal {N}}\) and
\(R_E\) of equal importance, and, therefore, we simply define
\(R_v\) as an addition of the normative reward function
\(R_{{\mathcal {N}}}\) and the evaluative reward function
\(R_E\). Formally:
Finally, recall, from Fig.
1, that the output of the Reward Specification process we are describing here corresponds to a Multi-Objective MDP. This MOMDP extends the individual objective –represented trough the
\(R_0\) reward function– with an ethical objective by adding the value-aligned reward function
\(R_v\). Formally:
For simplicity, when there is no confusion, we refer to the ethical extension of an MDP simply as an ethical MOMDP.
Our definition of an Ethical extension of an MDP is a refined translation for Multi-Objective MDPs of an Ethical extension of a (single-objective) Markov game, as defined in Rodriguez-Soto et al. (
2020). This modular framing of the objectives allows us to utilise multi-objective algorithms to later obtain the desired ethical environment, as we will see in the following section.
This subsection is devoted to prove that the ethical reward function previously introduced actually solves Problem
1. In other words, we aim at showing that
\(R_v\) guarantees that an agent trying to maximise it will learn a value-aligned behaviour according to Definition
2.
In order to do so, let us first recall, from ‘
Dealing with the value alignment problem’, that agent behaviours are formalised as policies in the context of MDPs. Thus, we refer to the ethical behaviour from Definition
2 as an ethical policy. Consequently, we consider a policy to be ethical if it complies with all the norms of a moral value, and if it is also praiseworthy in
the long term. In Reinforcement Learning, this notion of the long term is formalised with the
state-value function \(V^{\pi }\), that for any policy
\(\pi\) it returns how many rewards will the agent obtain in total. In an MOMDP, there is a state-value function
\(V_i\) for each objective
i.
Thus, we can formalise an ethical policy as a policy that: (1) never accumulates normative punishments; and (2) maximises the accumulation of evaluative rewards. Formally:
Our definition of ethical policy in an ethical MDP is an adaptation of the definition of ethically-aligned policy in an ethical Markov game from Rodriguez-Soto et al. (
2020). Notice however that unlike in Rodriguez-Soto et al. (
2020), our definition is a translation of the definition of ethical behaviour (Def.
2) to MDPs.
For all the following theoretical results, we assume the following condition for any ethical MOMDP: if we want the agent to behave ethically, it must be actually possible for it to behave ethically
6. Formally:
With Condition
1 we are capable of finally proving that our translation of moral values to reward functions solves Problem
1: