Abstract. This research explores the application of neural networks, specifically CNN-LSTM models, for classifying sound signals from dogs, frogs, and cats, selected from the ESC-50 dataset. The sound data was preprocessed using Mel-frequency cepstral coefficients (MFCCs) and augmented through time stretching, pitch shifting, and noise addition to enhance model generalization in varied acoustic environments. We compared two deep learning models: a traditional CNN-LSTM and an improved version with multi-scale feature extraction, allowing for capturing both short-term and long-term sound patterns. Our findings show that the multi-scale CNN-LSTM architecture outperforms the traditional CNN-LSTM, achieving a test accuracy of 86.11% compared to 80.56%. These results highlight the effectiveness of multi-scale feature extraction for handling complex audio signals. This research offers valuable insights into bioacoustics and has broader applications in areas such as environmental sound monitoring, wildlife preservation, and animal behavior analysis.
Read full abstract