Abstract
Reward tampering is a problem that will impact the trustworthiness of the powerful AI systems of the future. Reward Tampering describes the problem where AI agents bypass their intended objective, enabling unintended and potentially harmful behaviours. This paper investigates whether the creative potential of evolutionary algorithms could help ensure trustworthy solutions when facing this problem. The reason why evolutionary algorithms may help combat reward tampering is that they are able to find a diverse collection of different solutions to a problem within a single run, aiding the search for desirable solutions. Four different evolutionary algorithms were deployed in tasks illustrating the problem of reward tampering. The algorithms were designed with varying degrees of human expertise, measuring how human guidance influences the ability to discover trustworthy solutions. The results indicate that the algorithms’ ability to find and preserve trustworthy solutions is very dependent on preserving diversity during the search. Algorithms searching for behavioural diversity showed to be the most effective against reward tampering. Human expertise also showed to improve the certainty and quality of safe solutions, but even with only a minimal degree of human expertise, domain-independent diversity management was found to discover safe solutions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.