Age and gender recognition based on multiple systems - early vs. late fusion

Tobias Bocklet,Elmar Nöth,Georg Stemmer,Viktor Zeissler

doi:10.21437/interspeech.2010-748

Abstract

This paper focuses on the automatic recognition of a person’s age and gender based only on his or her voice. Up to five different systems are compared and combined in different configurations: three systems model the speaker’s characteristics in different feature spaces, i.e., MFCC, PLP, TRAPS, by Gaussian mixture models. The features of these systems are the concatenated mean vectors. System number 4 uses a physical two-mass vocal model and estimates in a data-driven optimization procedure 9 glottal features from voiced speech sections. For each utterance the minimum, maximum and mean vectors form a 27-dimensional feature vector. The last system calculates a 219-dimensional prosodic feature set for each utterance based on voice and unvoiced speech segments. We compare two different ways to fuse the different systems: First, we concatenate the system on feature level. The second way of combination is performed on score level by multi-class logistic regression. Despite there are just minor differences between the two approaches, late fusion is slightly superior. On the development set of the Interspeech Agender challenge we achieved an unweighted recall of 46.1% with early fusion and 47.8% with late fusion. Index Terms: acoustic analysis, classification, Gaussian mixture models

Full Text