Abstract In this paper, we propose a scene animation network (FSGAN) method that uses spectral features to perceive the deep style of an image, which uses three loss functions for perceiving the distribution of image style features in the same style domain to generate anime style-consistent images. The recurrent consistency adversarial network is supplemented with a local reinforcement module based on self-attention, adapted to the animalization of portrait scenes based on their characteristics and needs. This module guides the network to pay attention to the salient regions in the image and combines layer normalization and instance normalization methods through the weighting of the attention weight coefficients to construct a portrait animation network based on local reinforcement perception. Then, based on the digital media technology to complete the virtual reality scene construction, the constructed space for three-dimensional division, and the use of generative adversarial network training model, the proposed key events and random sampling technology combined with the control of virtual animation scene generation method, to achieve the controllability and diversity of the virtual animation scene generation. The first-stage design efficiency, the second-stage design efficiency and the overall design efficiency of the animation creation method in this paper are all basically stabilized between 0.8-1. Compared with the traditional method, they are higher by 0.3932, 0.3596, and 0.5635, respectively. The extreme pixel percentage, standard deviation and contrast of the animation images created by this paper’s method are 0.0349, 0.0037 and 0.3382, respectively, and the overall color performance is better. Thus, the method presented in this paper has the best overall effect on processing anime images using digital media art effects.