Deep End-to-End Representation Learning for Food Type Recognition from Speech

Benjamin Sertolli,Nicholas Cummins,Bjoern W Schuller,Abdulkadir Sengur

doi:10.1145/3242969.3243683

Benjamin Sertolli, Nicholas Cummins + Show 2 more

Open Access

https://doi.org/10.1145/3242969.3243683

Copy DOI

Publication Date: Oct 2, 2018
Citations: 1	License type: public-domain

Affiliation: University of Augsburg

Abstract

The use of Convolutional Neural Networks (CNN) pre-trained for a particular task, as a feature extractor for an alternate task, is a standard practice in many image classification paradigms. However, to date there have been comparatively few works exploring this technique for speech classification tasks. Herein, we utilise a pre-trained end-to-end Automatic Speech Recognition CNN as a feature extractor for the task of food-type recognition from speech. Furthermore, we also explore the benefits of Compact Bilinear Pooling for combining multiple feature representations extracted from the CNN. Key results presented indicate the suitability of this approach. When combined with a Recurrent Neural Network classifier, our strongest system achieves, for a seven-class food-type classification task an unweighted average recall of 73.3% on the test set of the iHEARu-EAT database.

Full Text