Using Vision and Speech Features for Automated Prediction of Performance Metrics in Multimodal Dialogs

Vikram Ramanarayanan,Patrick Lange,Eugene Tsuprun,Keelan Evanini,Hillary Molloy,David Suendermann‐Oeft,Yao Qian

doi:10.1002/ets2.12146

Vikram Ramanarayanan, Patrick Lange + Show 5 more

Open Access

https://doi.org/10.1002/ets2.12146

Copy DOI

Abstract

Predicting and analyzing multimodal dialog user experience (UX) metrics, such as overall call experience, caller engagement, and latency, among other metrics, in an ongoing manner is important for evaluating such systems. We investigate automated prediction of multiple such metrics collected from crowdsourced interactions with an open-source, cloud-based multimodal dialog system in the educational domain. We extract features from both the audio and video signals and examine the efficacy of multiple machine learning algorithms in predicting these performance metrics. The best performing audio features consist of multiple low-level audio descriptors—intensity, loudness, cepstra, pitch, and so on—and their functionals, extracted using the OpenSMILE toolkit, while the video features are bags of visual words that use 3D Scale-Invariant Feature Transform descriptors. We find that our proposed methods outperform the majority vote classification baseline in predicting various UX metrics rated by both the user and experts. Our results suggest that such automated prediction of performance metrics can not only inform the qualitative and quantitative analysis of dialogs but also be potentially incorporated into dialog management routines for positively impacting UX and other metrics during the course of the interaction.

Full Text