Recently, image generation technology has demonstrated surprising effects. However, precisely recognizing the emotion in sound to accurately express it on the face of a designated person is a huge challenge. To address this challenge, a new framework, Sound to Expression (S2E), which can use the emotion in sound to guide facial expression image generation, is proposed. A speech dataset for emotion recognition is constructed. S2E can edit facial expressions with different emotions in sounds for different people. S2E consists of Continuous Wavelet Transform (CWT), YOLOv3, ChatGPT-3, and facial expression diffusion editing model (FEDEM). CWT is utilized to extract emotional features from different sounds. YOLOv3 is employed to identify the emotion categories. The emotion category and a specific person's name are input into ChatGPT-3 to randomly generate a description of the person and emotion. The description is input into FEDEM to generate a facial expression image. To generate more accurate images and address emotional semantic deviation, a new facial detail emotional preservation loss is proposed. The experimental results show that S2E can accurately recognize the emotion in the voice and use this emotion to guide the editing of the facial expression for the specified person to generate more accurate images.
Read full abstract