Abstract

Cued Speech (CS) is an augmented lip reading with the help of hand coding. Due to lips and hand movements are asynchronous and a direct fusion of these asynchronous features may reduce the efficiency of the recognition, the fusion of them in automatic CS recognition is a challenging problem. In our previous work, we built a hand preceding model for hand positions (vowels) by investigating the temporal organization of hand movements in French CS. In this work, we investigate a suitable value of the hand preceding time for consonants by analyzing the temporal movements of hand shapes in French CS. Then, based on these two results, we propose an efficient resynchronization procedure for the fusion of multi-stream features in CS. This procedure is applied to the continuous CS phoneme recognition based on the multi-stream CNN-HMMs architecture. The result shows that using this procedure brings an improvement of about 4.6% in the phoneme recognition correctness, compared with the state-of-the-art, which does not take into account the asynchrony of multi-modalities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.