research-article

Incorrigibility in the CIRL Framework

Author:
Ryan Carey

Oxford University, Oxford, United Kingdom

Oxford University, Oxford, United Kingdom
View Profile

AIES '18: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and SocietyDecember 2018Pages 30–35https://doi.org/10.1145/3278721.3278750

Published:27 December 2018Publication History

AIES '18: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society

Pages 30–35

ABSTRACT

A value learning system has incentives to follow shutdown instructions, assuming the shutdown instruction provides information (in the technical sense) about which actions lead to valuable outcomes. However, this assumption is not robust to model mis-specification (e.g., in the case of programmer errors). We demonstrate this by presenting some Supervised POMDP scenarios in which errors in the parameterized reward function remove the incentive to follow shutdown commands. These difficulties parallel those discussed by Soares et al. 2015 in their paper on corrigibility. We argue that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function). We discuss some difficulties with simple ways to attempt to attain these sorts of guarantees in a value learning framework.

References

Stuart Armstrong. 2010. Utility Indifference . Technical Report 2010--1. Oxford: Future of Humanity Institute, University of Oxford.Google Scholar
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. 2017. The Off-Switch Game. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. 220--227. Google ScholarDigital Library
Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. 2017. Should Robots be Obedient? arXiv preprint arXiv:1705.09990 (2017). Google ScholarDigital Library
Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. 2015. Corrigibility. In 1st International Workshop on AI and Ethics at AAAI-2015 .Google Scholar

Index Terms

Incorrigibility in the CIRL Framework
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI theory, concepts and models
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Reinforcement learning
        Inverse reinforcement learning
        Multi-agent reinforcement learning

Recommendations

Risk-averse Distributional Reinforcement Learning: A CVaR Optimization Approach
IJCCI 2019: Proceedings of the 11th International Joint Conference on Computational Intelligence

Conditional Value-at-Risk (CVaR) is a well-known measure of risk that has been directly equated to robustness, an important component of Artificial Intelligence (AI) safety. In this paper we focus on optimizing CVaR in the context of Reinforcement ...
Read More
Towards mutation testing of Reinforcement Learning systems
Abstract
Reinforcement Learning (RL), one of the most active research areas in artificial intelligence, focuses on goal-directed learning from interaction with an uncertain environment. RL systems play an increasingly important role in many ...
Read More
Mutation Testing of Reinforcement Learning Systems
Dependable Software Engineering. Theories, Tools, and Applications
Abstract
Reinforcement Learning (RL), one of the most active research areas in artificial intelligence, focuses on goal-directed learning from interaction with an uncertain environment. RL systems play an increasingly important role in many aspects of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIES '18: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society
December 2018
406 pages
ISBN:9781450360128
DOI:10.1145/3278721
Program Chairs:
Jason Furman
Harvard University, USA
,
Gary Marchant
Arizona State University, USA
,
Huw Price
Cambridge University, UK
,
Francesca Rossi
IBM Research, USA & University of Padova, Italy
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 December 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ai safety
cirl
cooperative inverse reinforcement learning
corrigibility
Qualifiers
- research-article
Conference

Acceptance Rates
AIES '18 Paper Acceptance Rate61of162submissions,38%Overall Acceptance Rate61of162submissions,38%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 83
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Incorrigibility in the CIRL Framework

AIES '18: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society

ABSTRACT

References

Cited By

Index Terms

Recommendations

Risk-averse Distributional Reinforcement Learning: A CVaR Optimization Approach

Towards mutation testing of Reinforcement Learning systems

Mutation Testing of Reinforcement Learning Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Incorrigibility in the CIRL Framework

AIES '18: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society

ABSTRACT

References

Cited By

Index Terms

Recommendations

Risk-averse Distributional Reinforcement Learning: A CVaR Optimization Approach

Towards mutation testing of Reinforcement Learning systems

Mutation Testing of Reinforcement Learning Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media