Abstract

Animal sound perception systems are highly developed compared to humans, crucial for survival in natural environments, with some species possessing specialized sensory capabilities such as vision, hearing, touch, and environmental awareness. Understanding animal sounds not only aids in their own communication and survival but also benefits humans in various fields including security, natural disaster prediction, ecological research, bioacoustics, precision agriculture, and search and rescue operations. Motivated by this fact, this study investigated the classification of cat sounds using deep learning models based on Vision Transformer (ViT) and Convolutional Neural Network (CNN) architectures. Cat vocalizations, represented as mel-spectrograms, were classified using models trained on a diverse dataset of cat sounds. Experimental results demonstrated the superiority of the proposed model based on Microsoft’s BERT Pre-Training of Image Transformers (BEiT) over the state-of-the-art as it obtained an exceptional accuracy of 96.95%. Additionally, it was observed that the proposed models based on ViT outperformed CNN-based models, highlighting the efficacy of transformer architectures in capturing complex patterns within audio data. These findings underscore the potential of ViT architectures in decoding animal communication systems and advancing wildlife conservation efforts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.