Vision-based analysis of human face and gesture

Songyao Jiang

doi:10.17760/d20449060

Abstract

Faces and gestures provide a natural way for humans to communicate with each other, manipulate objects, and do everyday activities. Accurate and robust modeling and analysis of human faces and gestures have drawn long-standing research attention in the past decades. Computer-vision scientists have explored using vision-based methods to learn high-level information about human faces and bodies from images and videos. Recently, the developments in deep convolutional neural networks (CNN) and the collection of high-quality large-scale datasets jointly enable accurate and robust recognition and prediction of human gestures and behaviors. Meanwhile, the emergence of deep generative adversarial models (GAN) upsurges an increasing trend of synthesizing realistic samples and has raised tremendous attention in both academic and industrial areas. This dissertation focuses on the vision-based analysis of human faces and gestures and discusses a few topics of synthesizing faces, manipulating face images, estimating human poses on edge devices, and recognizing sign languages. First, we decouple the face image synthesis task into three independent dimensions and propose a novel Spatially Constrained Generative Adversarial Network (SCGAN) to model it. We then extend the spatial constraints to the image translation task for face manipulation by proposing a Segmentation Guided Generative Adversarial Networks (SGGAN) and its improved version Geometrically Editable Generative Adversarial Networks (GEGAN). Second, we tacked the challenging problem of modeling human gestures on edge devices with 2D human pose estimation models. We propose a model that inherently uses the pose estimation result of previous frames to refine the current estimation, which significantly mitigates the vibrations of the estimated skeletons. We reduce the model complexity by proposing a light-weight pose module so that it can run in real-time on edge devices with comparable performance. Last, sign language recognition (SLR) is an essential yet challenging task that bridges the gap between sign language users and others. We propose a Skeleton Aware Multi-modal SLR (SAM-SLR) framework that leverages skeleton-based graph representations of full-body poses and uses the multi-modal ensemble toward a higher recognition rate. In this dissertation, I introduce the problem background and settings, present the technical details of our novel approaches, and conduct extensive experiments to evaluate them on popular benchmark datasets compared with state-of-the-art methods. --Author's abstract

Full Text