Abstract
There is a tendency to deal with a speaker-independent recognition task in the lip-reading field by collecting speech scenes from many speakers. The data collection task is time-consuming. This paper proposes a method to solve this problem. According to a driving video, First Order Motion Model (FOMM) is a deep generative model that generates a video sequence from a source image. Our idea is to apply FOMM to all speech scenes in the dataset to generate the speech scenes recording from one speaker. We propose a preprocessing method to replace the speaker-independent recognition task with the speaker-dependent recognition task by applying FOMM. We applied the proposed method to two publicly available databases: OuluVS and CUAVE, and confirmed that the recognition accuracy was improved by applying the proposed method to both databases.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have