Abstract

Inspired by the recent development of deep network-based methods in semantic image segmentation, we introduce an end-to-end trainable model for face mask extraction in video sequence. Comparing to landmark-based sparse face shape representation, our method can produce the segmentation masks of individual facial components, which can better reflect their detailed shape variations. By integrating convolutional LSTM (ConvLSTM) algorithm with fully convolutional networks (FCN), our new ConvLSTM-FCN model works on a per-sequence basis and takes advantage of the temporal correlation in video clips. In addition, we also propose a novel loss function, called segmentation loss, to directly optimise the intersection over union (IoU) performances. In practice, to further increase segmentation accuracy, one primary model and two additional models were trained to focus on the face, eyes, and mouth regions, respectively. Our experiment shows the proposed method has achieved a 16.99% relative improvement (from 54.50 to 63.76% mean IoU) over the baseline FCN model on the 300 Videos in the Wild (300VW) dataset.

Highlights

  • The sparse facial shape descriptor extracted with traditional landmark-based face-tracker usually cannot capture the full details of the facial components’ shapes, which are essential to the recognition of higher level features such as facial expressions, emotions, identity, and so on

  • Our experimental results show that the utilisation of temporal information could significantly improve fully convolutional networks (FCN)’s performances for face mask extraction, and the performance of ConvLSTMFCN model surpass that of traditional landmark tracking models (63.76% vs. 60.09%)

  • It is worth mentioning that, to the best of our knowledge, there is no similar work in terms of semantic face segmentation or face mask extraction in video sequence, so we have investigated the studies of video semantic segmentation instead

Read more

Summary

Introduction

The sparse facial shape descriptor extracted with traditional landmark-based face-tracker usually cannot capture the full details of the facial components’ shapes, which are essential to the recognition of higher level features such as facial expressions, emotions, identity, and so on. Communicated by Rama Chellappa, Xiaoming Liu, Tae-Kyun Kim, Fernando De la Torre and Chen Change Loy. tic image segmentation methods, we propose a novel approach for extracting face mask in video sequence. Different from semantic face segmentation, face mask extraction handles occlusion in a similar way to facial landmark tracking. The extract face mask is expected to be complete regardless of occlusion, while typical segmentation result would exclude the occluded area. To the best of our knowledge, this is the first exploration of face mask extraction in video sequence with an end-to-end trainable deep-learning model

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call