Effective multiple person recognition in random video sequences using a convolutional neural network

Niraimathi Puhalanthi,Daw-Tung Lin

doi:10.1007/s11042-019-7323-z

Abstract

Effective and efficient face recognition through pervasive networks of surveillance cameras is one of the most challenging objectives of advanced computer vision. This study developed a real-time person recognition system (PRS) for the effective identification of multiple people in video sequences. We focused on identifying approximately 9000 celebrities by intelligent preprocessing, training, and deployment of a deep-learning convolutional neural network (CNN). The proposed PRS method comprises the following three major steps. In the first step, multiple faces present in a given frame as well as their associated landmarks are detected. This must be precise because the accuracy of this step dictates the accuracy of the complete PRS. In the second step, the extracted facial regions of interest are then aligned using affine warping, based on their respective identified landmark positions. The alignment process is meant to ensure correct identification of a person, because a wide range of faces entails intrinsic interclass similarities. Finally, in the third step, a VGG-19 CNN is trained to classify the aligned facial images for person recognition. In the training phase of the PRS, we utilized images from the CASIA WebFace database, which contains nearly 9000 classes, and aligned them using their respective facial landmarks. Subsequently, we used the aligned images to train a VGG-19 CNN classifier. For the purpose of validation, the trained classifier was tested with the standard Labelled Faces in the Wild (LFW) database by extracting the features for the LFW images using the trained VGG. Specifically, the VGG-extracted LFW features were used to train support vector machine classifiers, and the obtained resultant classification accuracy of approximately 96% was very close to the currently existing benchmark for the LFW database. During the testing phase, alternate frames of the input video were extracted and the identified faces (post-alignment) were used as inputs into the trained VGG to recognize the people in a given frame. When tested on random samples of video images, the proposed PRS offered robust recognition performance for most of the facial regions that had reasonable facial orientations and sizes. Furthermore, the average recognition time per person was approximately 370 milliseconds. The proposed deep learning-based PRS is the first of its kind to exhibit real-time performance for person recognition with significant accuracy, without involving any prior knowledge of the people involved in a video.

Full Text