Reward tampering is a problem that will impact the trustworthiness of the powerful AI systems of the future. Reward Tampering describes the problem where AI agents bypass their intended objective, enabling unintended and potentially harmful behaviours. This paper investigates whether the creative potential of evolutionary algorithms could help ensure trustworthy solutions when facing this problem. The reason why evolutionary algorithms may help combat reward tampering is that they are able to find a diverse collection of different solutions to a problem within a single run, aiding the search for desirable solutions. Four different evolutionary algorithms were deployed in tasks illustrating the problem of reward tampering. The algorithms were designed with varying degrees of human expertise, measuring how human guidance influences the ability to discover trustworthy solutions. The results indicate that the algorithms’ ability to find and preserve trustworthy solutions is very dependent on preserving diversity during the search. Algorithms searching for behavioural diversity showed to be the most effective against reward tampering. Human expertise also showed to improve the certainty and quality of safe solutions, but even with only a minimal degree of human expertise, domain-independent diversity management was found to discover safe solutions.
Read full abstract