This study presented the development of a deep learning-based Automatic Speech Recognition (ASR) system for Shona, a low-resource language characterized by unique tonal and grammatical complexities. The research aimed to address the challenges posed by limited training data, a lack of labelled data, and the intricate tonal nuances present in Shona speech, with the objective of achieving significant improvements in recognition accuracy compared to traditional statistical models. Motivated by the limitations of existing approaches, the research addressed three key questions. Firstly, it explored the feasibility of using deep learning to develop an accurate ASR system for Shona. Secondly, it investigated the specific challenges involved in designing and implementing deep learning architectures for Shona speech recognition and proposed strategies to mitigate these challenges. Lastly, it compared the performance of the deep learning-based model with existing statistical models in terms of accuracy. The developed ASR system utilized a hybrid architecture consisting of a Convolutional Neural Network (CNN) for acoustic modelling and a Long Short-Term Memory (LSTM) network for language modelling. To overcome the scarcity of data, the research employed data augmentation techniques and transfer learning. Attention mechanisms were also incorporated to accommodate the tonal nature of Shona speech. The resulting ASR system achieved impressive results, with a Word Error Rate (WER) of 29%, Phoneme Error Rate (PER) of 12%, and an overall accuracy of 74%. These metrics indicated a significant improvement over existing statistical models, highlighting the potential of deep learning to enhance ASR accuracy for under-resourced languages like Shona. This research contributed to the advancement of ASR technology for under-resourced languages like Shona, ultimately fostering improved accessibility and communication for Shona speakers worldwide.
Read full abstract