An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17% for the spectrogram and 94.83% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions.
Read full abstract