Representation transfer learning from deep end-to-end speech recognition networks for the classification of health states from speech

Benjamin Sertolli,Zhao Ren,Björn W Schuller,Nicholas Cummins

doi:10.1016/j.csl.2021.101204

Abstract

Representation transfer learning has been widely used across a range of machine learning tasks. One such notable approach seen in the speech literature is the use of Convolutional Neural Networks, pre-trained for image classification tasks, to extract features from spectrograms of speech signals. Interestingly, despite the strong performance of such approaches, there have been minimal research efforts exploring the suitability of using speech-specific networks to perform feature extraction. In this regard, a novel feature representation learning framework is presented herein. This approach is comprising the use of Automatic Speech Recognition (ASR) deep neural networks as feature extractors, the fusion of several extracted feature representations using Compact Bilinear Pooling (CBP), and finally inference via a specially optimised Recurrent Neural Network (RNN) classifier. To determine the usefulness of these feature representations, they are comprehensively tested on two representative speech-health classification tasks, namely the food-type being eaten and speaker intoxication. Key results indicate the promise of the extracted features, demonstrating comparable results to other state-of-the-art approaches in the literature.

Full Text