Visual Speech Recognition in Natural Scenes Based on Spatial Transformer Networks

Jin Yu,Shilin Wang

doi:10.1109/asid50160.2020.9271783

Abstract

In this paper, we improve the performance of visual speech recognition in natural scenes based on spatial transformer networks. Visual speech recognition can be applied to authentication systems for liveness detection to avoid replay attacks and ensure security. Identity authentication based on visual speech recognition may be conducted anywhere on portable electronic devices. However, a great number of variations exist in natural scenes including diverse speakers’ poses, different distances towards the camera, occasional quiver of the lips, etc., which bring tremendous troubles for the recognition, leading to poorer performance of the authentication system. In view of the challenges, we introduce the spatial transformer networks (STN), which can help deal with variations, especially in complex natural scenes. Considering the characteristics of the lip feature, a new transformation network is proposed, which fuses the temporal and spatial information to generate transformation parameters. The well-designed network can be simply inserted into existing visual speech recognition approaches to implement end-to-end training. By taking temporal dependencies into consideration, a better transformation is performed to normalize the lip image sequences and difficulties of visual speech recognition in natural scene can thus be reduced, which is beneficial to the identity authentication system to enhance security. From the experimental results, it is demonstrated that a decreased word error rate can be achieved, particularly in natural scenes, when our approach is adopted.

Full Text