Abstract
Speech is the natural source of information for human identification in most biometrics, forensics, and access control systems. Mismatch in speech data is one of the biggest challenges preventing speaker identification systems from being employed in real-world scenarios. This research explores how speaker identification tasks are affected by degraded speech. Our preliminary investigation into mismatch effects in training and test conditions utilizing the IIT-G database includes mismatch in sensor and speaking styles. Convolutional neural networks (CNNs) have surpassed traditional techniques in speaker identification (SI) systems in recent years. This paper proposes a novel architecture based on a VGG-like network for an end-to-end speaker identification system. The proposed architecture outperforms the statistical methods with an improvement of 8% accuracy in identifying the speakers. The results show that the suggested approach is more accurate than state-of-the-art speaker identification techniques and notable performance deterioration compared to the matched scenario.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have