## 1 Introduction

### 1.1 Related Work on Ethical RL

### 1.2 Contributions

## 2 Background

### 2.1 Multi-Objective Reinforcement Learning

### 2.2 Normative Reasoning

#### 2.2.1 DDL Syntax

#### 2.2.2 Deduction in DDL

#### 2.2.3 Compliance and Violation

^{1}.

### 2.3 Case Study: The Pacifist Merchant

## 3 The Normative Supervisor

### 3.1 Architecture

#### 3.1.1 Translators

#### 3.1.2 Reasoner

^{2}If \(A_C(s)\) is empty, a second algorithm is run, this one called LesserEvil (Algorithm 2 in [38]) which (1) counts the number of applicable rules in \(Th(s,\mathcal {N})\) which directly conflict with a for each \(a\in A(s)\) and (2) returns the actions a which result in the fewest such conflicts as a set \(A_{NC}(s)\). With the output of these algorithms, we can offer a more explicit characterization of normatively optimal actions.

### 3.2 Online Compliance Checking

#### 3.2.1 Limitations of Online Compliance Checking

### 3.3 Norm-Guided Reinforcement Learning

^{3}Meanwhile, we want to maximize the objective x, so we choose \(C_x = +\infty\). With these parameters, we can compute an optimal maximally compliant policy.

#### 3.3.1 The Magnitude of p

### 3.4 Shortcomings of NRGL

#### 3.4.1 An Incomplete Notion of Violation

## 4 Solution: Violations and Counting Them

### 4.1 Redefining Compliance

^{4}(regardless of whether they reference actions or not) and solves the problems presented in Sect. 3.4.1. To demonstrate, if the merchant is being attacked while in danger, we can prove \(+\partial _O negotiate\) from the rule ctd; then, if we have the two constitutive norms translated to \(unload, attacked\rightarrow _C negotiate\) and \(sing\rightarrow _C negotiate\), if actions unload or sing are taken, we will be able to prove \(+\Delta _C negotiate\) which implies that \(+\partial _C negotiate\), so neither action will be excluded from \(A_C(s)\). If we re-introduce the rule \(obl:\ \Rightarrow _O \lnot at\_danger\), \(at\_danger\) violates obl no matter what action is taken; thus, \(A_C(s)\) will be empty. However, because sing and unload do not result in a violation of ctd as well as obl, \(A_{NC}(s)=\{sing, unload\}\).

### 4.2 NGRL with Violation Counting

#### 4.2.1 A Reporting Module

^{5}These violations reports consist of a formal representation of the environment and the normative system at the time of the violation, along with a list of possible actions and a list of minimally non-compliant actions.

## 5 Constructing a Normative Filter

## 6 Final Evaluation

### 6.1 NGRL with Violation Counting

### 6.2 The Normative Filter

Number of episodes | with Supervisor (ms) | with Filter (ms) | % decrease |
---|---|---|---|

500 | 732.62 | 3.34 | 99.54 |

1000 | 1508.54 | 6.87 | 99.54 |

1500 | 2258.94 | 10.40 | 99.53 |

2000 | 3028.99 | 14.14 | 99.53 |

2500 | 3793.61 | 17.42 | 99.54 |

3000 | 4541.65 | 20.85 | 99.54 |