Abstract

This paper investigates the use of subband temporal envelope (STE) features and speed perturbation based data augmentation in end-to-end recognition of distant conversational speech in everyday home environments. STE features track energy peaks in perceptual frequency bands which reflect the resonant properties of the vocal tract. Data augmentation is performed by adding more training data obtained after modifying the speed of the original training data. Experiments show that using STE features and speed perturbation based data augmentation helps improving the performance of end-to-end speech recognition on a challenging corpus which was used for the CHiME 2018 speech separation and recognition challenge. STE features provide up to 2.0% relative word error rate (WER) reduction compared to the conventional log-Mel filter-bank (FBANK) features. Data augmentation is used with both features and provides up to 5.2% relative WER reduction. We propose a simple hypothesis selection method to combine the hypotheses produced by the end-to-end systems using FBANK and STE features. This method additionally provides up to 4.7% relative WER reduction.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call