Abstract

Talking video frames occasionally drop while streaming for reasons like network errors, which greatly hurts the online team collaboration and user experiences. Directly generating the dropped frames from the remaining ones is unfavorable, since a person's lip motion is usually non-linear and thus hard to be restored when consecutive frames are missing. Nevertheless, the audio content provides strong signals for lip motion and is less likely to drop during transmitting. Inspired by this, as an initial attempt, we present the task of audio-driven talking video frame restoration in this paper, i.e., restoring dropped video frames by jointly leveraging the audio and remaining video frames. Towards the high-quality frame generation, we devise a cross-modal frame restoration network. This network aligns the complete audio content with video frames, precisely identifies and sequentially generates the dropped frames. To justify our model, we construct a new dataset, Talking Video Frames Drop, TVFD for short, consisting of 2.5K video and 144K frames in total. We conduct extensive experiments over TVFD and another publicly accessible dataset - Voxceleb2. Our model obtains significantly improved performance as compared to other state-of-the-art competitors.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.