Addressing corrigibility in near-future AI systems

Erez Firt

doi:10.1007/s43681-024-00484-9

Erez Firt

Open Access

PDF Available

https://doi.org/10.1007/s43681-024-00484-9

Copy DOI

Export

Save

Cite

Journal: AI and Ethics	Publication Date: May 16, 2024
License type: CC BY 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

AbstractWhen we discuss future advanced autonomous AI systems, one of the worries is that these systems will be capable enough to resist external intervention, even when such intervention is crucial, for example, when the system is not behaving as intended. The rationale behind such worries is that such intelligent systems will be motivated to resist attempts to modify or shut them down so they can preserve their objectives. To mitigate and face these worries, we want our future systems to be corrigible, i.e., to tolerate, cooperate or assist many forms of outside correction. One important reason for considering corrigibility as an important safety property is that we already know how hard it is to construct AI agents with a generalized enough utility function; and the more advanced and capable the agent is, the more it is unlikely that a complex baseline utility function built into it will be perfect from the start. In this paper, we try to achieve corrigibility in (at least) systems based on known or near-future (imaginable) technology, by endorsing and integrating different approaches to building AI-based systems. Our proposal replaces the attempts to provide a corrigible utility function with the proposed corrigible software architecture; this takes the agency off the RL agent – which now becomes an RL solver – and grants it to the system as a whole.

Full Text