Abstract

Lipreading feature extraction is essentially the feature extraction of continuous video frame sequences. A lipreading model based on a two-way convolutional neural network and features is proposed to obtain more reasonable visual-spatial–temporal characteristics. Unlike other lipreading methods based on deep learning, the rank pooling method transforms lip video into a standard RGB image that can be directly input into the convolutional neural network, which effectively reduces the input dimension. In addition, to compensate for the lack of spatial information, the apparent shape and depth features are fused, and then the joint cost function is used to guide the network model learning to obtain more distinguishing features. The experimental results were evaluated on the public GRID database and OuluVS2 database. It shows that the accuracy of the proposed method can reach more than 93%, which validates the effectiveness of the method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call