Comparison of Artificial Neural Network Types for Infant Vocalization Classification

Franz Anders,Mirco Fuchs,Mario Hlawitschka

doi:10.1109/taslp.2020.3037414

Abstract

In this study we compared various neural network types for the task of automatic infant vocalization classification, i.e convolutional, recurrent and fully-connected networks as well as combinations of thereof. The goal was to first determine the optimal configuration for each network type to then identify the type with the highest overall performance. This investigation helps to employ neural networks more effectively to infant vocalization classification tasks, which typically offer low amounts of training data. To this end, we defined a unified neural network architecture scheme for audio classification from which we derived various network types. For each type we performed a semi-random hyperparameter search which employed regression trees to both focus the search space as well as derive insights on the most influential parameters. We finally compared the test performances of the best performing configurations in an contest-like setup. Our key findings are: (1) Networks with convolutional stages reached the highest performance, regardless of being combined with fully-connected or recurrent layers. (2) The most influential architectural hyperparameter for all types were the integration operations for reducing tensor dimensionality between network stages. The best performing configurations reached test performances of 75% unweighted average recall, surpassing previously published benchmarks.

Highlights

C LASSIFICATION of infant vocalizations into qualitative categories is among the most relevant tasks in automatic infant vocalization assessment
This research substantially expands the scope of our previous work [29] in which we investigated ordinary VGG-like convolutional neural networks (CNNs) for infant vocalization classification, albeit on a different dataset and with different target classes; In the present study we further investigated various network types, covered a greater and more diverse search space, and used the competition’s dataset so that our results can be compared to competing approaches
The main result of the network type comparison is that networks with convolutional stages (C-NNs, C-R-NNs and C-FCNNs) outperformed recurrent networks (R-NNs and R-FC-NNs) by at least 2% performance

Summary

Introduction

C LASSIFICATION of infant vocalizations into qualitative categories is among the most relevant tasks in automatic infant vocalization assessment. Such systems are primarily applied in medical fields, e.g. pain assessment [1] or early detection of impairments and disorders [2], [3]. The degree of expertise required by humans to discriminate infant vocalizations is task specific: On the one hand, some. Manuscript received June 12, 2020; revised September 25, 2020; accepted October 24, 2020. Date of publication November 11, 2020; date of current version December 7, 2020. The associate editor coordinating the review of this manuscript and approving it for publication was Prof.

Methods

Results

Discussion

Conclusion