Creating realistic animations of human faces is still a challenging task in computer graphics. While computer graphics (CG) models capture much variability in a small parameter vector, they usually do not meet the necessary visual quality. This is due to the fact, that geometry‐based animation often does not allow fine‐grained deformations and fails in difficult areas (mouth, eyes) to produce realistic renderings. Image‐based animation techniques avoid these problems by using dynamic textures that capture details and small movements that are not explained by geometry. This comes at the cost of high‐memory requirements and limited flexibility in terms of animation because dynamic texture sequences need to be concatenated seamlessly, which is not always possible and prone to visual artefacts. In this study, the authors present a new hybrid animation framework that exploits recent advances in deep learning to provide an interactive animation engine that can be used via a simple and intuitive visualisation for facial expression editing. The authors describe an automatic pipeline to generate training sequences that consist of dynamic textures plus sequences of consistent three‐dimensional face models. Based on this data, they train a variational autoencoder to learn a low‐dimensional latent space of facial expressions that is used for interactive facial animation.