Inner lips feature extraction based on CLNF with hybrid dynamic template for Cued Speech

Li Liu,Denis Beautemps,Gang Feng

doi:10.1186/s13640-017-0233-y

Abstract

In previous French Cued Speech (CS) studies, one of the widely used methods is painting blue color on the speaker’s lips to make lips feature extraction easier. In this paper, in order to get rid of this artifice, a novel automatic method to extract the inner lips contour of CS speakers is presented. This method is based on a recent facial contour extraction model developed in computer vision, called Constrained Local Neural Field (CLNF), which provides eight characteristic landmarks describing the inner lips contour. However, directly applied to our CS data, CLNF fails in about 41.4% of cases. Therefore, we propose two methods to correct the B parameter (aperture of inner lips) and A parameter (width of inner lips), respectively. For correcting the B parameter, a hybrid dynamic correlation template method (HD-CTM) using the first derivative of smoothed luminance variation is proposed. HD-CTM is first applied to detect the outer lower lips position. Then, the inner lower lips position is obtained by subtracting the validated lower lips thickness (VLLT). For correcting the A parameter, a periodical spline interpolation with a geometrical deformation of six CLNF inner lips landmarks is explored. Combined with an automatic round lips detector, this method is efficient to correct A parameter for round lips (the third vowel viseme made of French vowels with a small opening). HD-CTM is evaluated on 4800 images of three French speakers. It corrects about 95% CLNF errors of the B parameter, and total RMSE of one pixel (i.e., 0.05 cm on average) is achieved. The periodical spline interpolation method is tested on 927 round lips images. The total error of CLNF is reduced significantly, which is comparable to the state of the art. Moreover, the third viseme is properly distributed in the parameter A and B plane after using this method.

Highlights

Lips detection is an active research topic since lips hold significant information speech production, and it plays an important role in speech recognition based on lips visual features
In 2013, Baltrusaitis et al [10] proposed the Constrained Local Neural Field (CLNF), which is robust for facial landmark detection in the general case
6.1 Parameter B correction based on hybrid dynamic correlation template method (HD-CTM) and backsubtracting of validated lower lips thickness (VLLT) The proposed method is based on the luminance variation along the middle CLNF landmarks of lips

Summary

Introduction

Lips detection is an active research topic since lips (especially inner lips) hold significant information speech production, and it plays an important role in speech recognition based on lips visual features. In 1967, Cornett [1] developed Cued Speech (CS), which is a complement of lipreading to enhance speech perception from visual input including lips and hand This system was adapted from American English to French in 1977. In French CS, which is named Langage Parlé Complété (LPC) [2], five hand positions are used to encode the vowels and eight hand configurations to encode the consonants. It is often used by deaf people or Several approaches to extracting lips contour in audiovisual speech processing have been investigated in the literature.

Objectives

Methods

Conclusion