ADFAC: Automatic detection of facial articulatory features

Saurabh Garg,Ghassan Hamarneh,Allard Jongman,Joan A Sereno,Yue Wang

doi:10.1016/j.mex.2020.101006

Saurabh Garg, Ghassan Hamarneh + Show 3 more

Open Access

https://doi.org/10.1016/j.mex.2020.101006

Copy DOI

Journal: MethodsX	Publication Date: Jan 1, 2020
Citations: 3	License type: cc-by

Affiliation: Simon Fraser University, University of Kansas

Abstract

Using computer-vision and image processing techniques, we aim to identify specific visual cues as induced by facial movements made during monosyllabic speech production. The method is named ADFAC: Automatic Detection of Facial Articulatory Cues. Four facial points of interest were detected automatically to represent head, eyebrow and lip movements: nose tip (proxy for head movement), medial point of left eyebrow, and midpoints of the upper and lower lips. The detected points were then automatically tracked in the subsequent video frames. Critical features such as the distance, velocity, and acceleration describing local facial movements with respect to the resting face of each speaker were extracted from the positional profiles of each tracked point. In this work, a variant of random forest is proposed to determine which facial features are significant in classifying speech sound categories. The method takes in both video and audio as input and extracts features from any video with a plain or simple background. The method is implemented in MATLAB and scripts are made available on GitHub for easy access.•Using innovative computer-vision and image processing techniques to automatically detect and track keypoints on the face during speech production in videos, thus allowing more natural articulation than previous sensor-based approaches.•Measuring multi-dimensional and dynamic facial movements by extracting time-related, distance-related and kinematics-related features in speech production.•Adopting the novel random forest classification approach to determine and rank the significance of facial features toward accurate speech sound categorization.

Highlights

BackgroundDifferent research articles have reported different methods to acquire and analyze visual speech articulatory movement data
In addition to tracking and feature extraction, we proposed a novel analysis method based on random forest for classification task
In the first step we identified which features differentiate each tone from the other tones by training a classifier

Summary

Introduction

BackgroundDifferent research articles have reported different methods to acquire and analyze visual speech articulatory movement data. The study of tone-vowel co-production by Shaw et al [17] used EMA This method involves putting sensor coils on the various parts of a speaker’s face and mouth, including lips, tongue, and jaw. OPTOTRAK has been used to capture eyebrow and jaw movements for sentence focus, with measurements of the displacement and peak velocity of these movements [10] Another similar sensor-based method using the motion capture system involves attaching retro-reflectors (Qualisys AB) to the speaker’s face for recording, allowing analysis of lip, eyebrow, and head displacement magnitude and movement velocity [16]. These sensor-based systems have limitations of their own. Sensor-based methods are more precise in capturing motion than annotation-based methods, only limited regions where the sensors are placed can be analyzed

Objectives

Methods

Findings

Conclusion