Computer-vision analysis shows different facial movements for the production of different Mandarin tones

Saurabh Garg,Yue Wang,Ghassan Hamarneh,Joan A Sereno,Allard Jongman,Lisa Tang

doi:10.1121/1.5067624

Abstract

We aim to identify visual cues resulting from facial movements made during Mandarin tone production and examine how they are associated with each of the four tones. We use signal processing and computer vision techniques to analyze audio-video recordings of 21 native Mandarin speakers uttering the vowel /ɜ/ with each tone. Four facial interest points were automatically detected and tracked in the video frames: medial point of left-eyebrow, nose tip (proxy for head movement), and midpoints of the upper and lower lips. Spatiotemporal features were extracted from the positional profiles of each tracked point. These features included distance, velocity, and acceleration of local facial movements with respect to the resting face of each speaker. Analysis of variance and feature importance analysis based on random decision forest were performed to examine the significance of each feature for representing each tone and how well these features can individually and collectively characterize each tone. Preliminary results suggest alignments between articulatory movements and pitch trajectories, with downward or upward head and eyebrow movements following the dipping and rising tone trajectories, faster lip-closing toward the end of falling tone production, and minimal movements for the level tone.

Full Text