Improved mini-batch multiple augmentation for low-resource spoken word recognition

Alexander Rogath Kivaisi,Qingjie Zhao

doi:10.1016/j.eswa.2024.124157

Abstract

Data augmentation techniques have been useful in dealing with limited data for machine learning tasks. Recently, spectrogram data augmentation techniques have been investigated for voice conversion and sound classification tasks and have produced better results. However, applying multiple data augmentation techniques within a mini-batch has been observed to lead to performance degradation. While applying multiple augmentation methods sequentially has shown performance gains in image data, transferring this approach to spectrogram data leads to loss of acoustic information. Hence, an alternative approach is needed to effectively utilize multiple augmentation methods in the speech domain. This study addressed these challenges in low-resource settings for spoken word recognition within the mini-batch. First, we investigated the effect of data augmentation techniques. Second, we investigated the effect of multiple data augmentation techniques. Finally, we proposed a new approach that uses an alternate mechanism to utilize multiple spectrogram augmentation techniques more effectively. The results of our experiment show that the proposed approach (new pattern) outperforms the sequential approach (traditional pattern) significantly at different scales of datasets, including low-resource settings. In addition, the proposed approach achieves approximately 2x actual speedup over the sequential approach. A combination of frequency-warping and time length control augmentation methods was found to be stable and robust in performance across all datasets evaluated.

Full Text