Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Preventing Deforestation in the Indian Landscape Through Neural Network-Based Intelligence Using Sound Event Detection and Advanced Feature Extraction Techniques

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

This study addresses forest monitoring in India by developing a sound event detection model using deep learning and advanced feature extraction techniques to distinguish chainsaw, handsaw, axe-cutting, and environmental sounds. The Customized-CNN achieved 98% accuracy, with performance improving from 95% to 98% as feature extraction clusters increased from two to six, demonstrating effective detection of deforestation-related activities.

Abstract
Translate article icon Translate Article Star icon

Forests play a vital role in maintaining ecological balance, regulating the climate, and conserving biodiversity. However, India’s forest landscape has witnessed significant changes between 1980 and 2024 due to deforestation, afforestation, and evolving conservation strategies. To address the challenges associated with forest monitoring, we proposed a model based on Sound Event Detection using a dataset comprising four classes: chainsaw sounds, handsaw sounds, axe-cutting sounds (synthetic), and negative environmental sounds (e.g., birds, animals, wind). The dataset was constructed from publicly available resources, except for the axe-cutting sound class, which was prepared synthetically. The model employed six feature extraction techniques Mel-Spectrogram, Mel-Frequency Cepstral Coefficients (MFCC), Chroma, Spectral Contrast, Tonnetz, and Spectral Bandwidth to capture critical audio characteristics. These features enabled the efficient representation of harmonic content, temporal patterns, and timbre, which were essential for distinguishing between classes. The proposed approach was executed using various deep learning models, including Customized 1D Convolutional Neural Networks (CNN), Bi-directional Convolutional Recurrent Neural Networks (Bi-CRNN), Bi-directional Gated Recurrent Unit-based CRNNs (Bi-GRU-CRNN), AlexNet, and ResNet. The Customized-CNN, implemented using Keras, demonstrated superior performance with an accuracy of 98%. The model’s effectiveness was further validated as accuracy increased progressively from 95 to 98% when transitioning from two to six feature extraction clusters.

Similar Papers
  • Conference Article
  • Cite Count Icon 12
  • 10.23919/eusipco47968.2020.9287372
Sound Event Localization and Detection Using Convolutional Recurrent Neural Networks and Gated Linear Units
  • Jan 24, 2021
  • Tatsuya Komatsu + 2 more

This paper proposes a sound event localization and detection (SELD) method using a convolutional recurrent neural network (CRNN) with gated linear units (GLUs). The proposed method introduces to employ GLUs with convolutional neural network (CNN) layers of the CRNN to extract adequate spectral features from amplitude and phase spectra. When the CNNs extract features of high-dimensional dependencies of frequency bins, the GLUs weight the extracted features based on the importance of the bins, like attention mechanism. Extracted features from bins where sounds are absent, which is not informative and degrade the SELD performance, are weighted to 0 and ignored by GLUs. Only the features extracted from informative bins are used for the CNN output for better SELD performance. Obtained CNN outputs are fed to consecutive bi-directional gated recurrent units (GRUs), which capture temporal information. Finally, the GRU output are shared by two task-specific layers, which are sound event detection (SED) layers and direction of arrival (DoA) estimation layers, to obtain SELD results. Evaluation results using the TAU Spatial Sound Events 2019 - Ambisonic dataset show the effectiveness of GLUs in the proposed method, and it improves SELD performance up to 0.10 in F1-score, 0.15 in error rate, 16.4° in DoA estimation error comparing to a CRNN baseline method.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1371/journal.pone.0300444
DEW: A wavelet approach of rare sound event detection.
  • Mar 28, 2024
  • PLOS ONE
  • Sania Gul + 2 more

This paper presents a novel sound event detection (SED) system for rare events occurring in an open environment. Wavelet multiresolution analysis (MRA) is used to decompose the input audio clip of 30 seconds into five levels. Wavelet denoising is then applied on the third and fifth levels of MRA to filter out the background. Significant transitions, which may represent the onset of a rare event, are then estimated in these two levels by combining the peak-finding algorithm with the K-medoids clustering algorithm. The small portions of one-second duration, called 'chunks' are cropped from the input audio signal corresponding to the estimated locations of the significant transitions. Features from these chunks are extracted by the wavelet scattering network (WSN) and are given as input to a support vector machine (SVM) classifier, which classifies them. The proposed SED framework produces an error rate comparable to the SED systems based on convolutional neural network (CNN) architecture. Also, the proposed algorithm is computationally efficient and lightweight as compared to deep learning models, as it has no learnable parameter. It requires only a single epoch of training, which is 5, 10, 200, and 600 times lesser than the models based on CNNs and deep neural networks (DNNs), CNN with long short-term memory (LSTM) network, convolutional recurrent neural network (CRNN), and CNN respectively. The proposed model neither requires concatenation with previous frames for anomaly detection nor any additional training data creation needed for other comparative deep learning models. It needs to check almost 360 times fewer chunks for the presence of rare events than the other baseline systems used for comparison in this paper. All these characteristics make the proposed system suitable for real-time applications on resource-limited devices.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/iscslp49672.2021.9362116
A Model Ensemble Approach for Sound Event Localization and Detection
  • Jan 24, 2021
  • Qing Wang + 9 more

In this paper, we propose a model ensemble approach for sound event localization and detection (SELD). We adopt several deep neural network (DNN) architectures to perform sound event detection (SED) and direction-of-arrival (DOA) estimation simultaneously. Generally, the DNN architecture consists of three modules stacked together, i.e, a High-level Feature Representation module, a Temporal Context Representation module, and a Fully-connected module in the end. The High-level Feature Representation module usually contains a series of convolutional neural network (CNN) layers to extract useful local features. The Temporal Context Representation module aims to model longer temporal context dependency in the extracted features. There are two parallel branches in the Fully-connected module with one for SED estimation and the other for DOA estimation. With different combinations of implementation in the High-level Feature Representation module and Temporal Context Representation module, several network architectures are used for the SELD task. At last, a more robust prediction of SED and DOA is obtained by model ensemble and post-processing. Tested on the development and evaluation datasets, the proposed approach achieves promising results and ranks the first place in DCASE 2020 task3 challenge. Index Terms: sound event localization and detection, deep neural network, model ensemble

  • Research Article
  • 10.47392/irjaeh.2025.0550
Machine Learning Methods for Speech Emotion Recognition
  • Sep 24, 2025
  • International Research Journal on Advanced Engineering Hub (IRJAEH)
  • Mr Arun Kumar E + 1 more

Natural human-computer interaction requires the ability to identify human emotions from speech. Due to its many uses in virtual assistants, mental health evaluation, education, entertainment, and customer support systems, speech emotion recognition, or SE, has attracted a lot of attention lately. This study uses sophisticated feature extraction and classification techniques to investigate a machine learning-based method for speech emotion classification. In this work, we use acoustic features like spectral contrast, chroma, and Mel-Frequency Cepstral Coefficients (MFCC) to extract emotional cues from speech signals. Convolutional Neural Networks (CNN), Random Forest (RF), and Support Vector Machines (SVM) are among the classifiers that are trained and assessed using these features. It makes use of the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) serves as the training and testing benchmark dataset. According to experimental results, deep learning models—particularly CNN and CNN-LSTM hybrids—perform better than conventional machine learning techniques. Combining temporal and spectral features effectively captures emotional nuances in speech, as evidenced by the CNN model's 84.2% accuracy and the CNN-LSTM model's peak accuracy of 86.7%. The suggested model's robustness and capacity for generalization are validated by a thorough analysis employing confusion matrices and precision-recall metrics. Understanding user emotions can greatly improve the quality of interactions in real-world applications, and this research offers a solid basis for integrating SER systems. Future research will focus on handling noisy environments, enhancing cross-linguistic performance, and enabling real-time deployment of embedded systems. This study also emphasizes how crucial it is to choose the ideal feature combination to accurately depict emotional content. The addition of Chroma and Spectral Contrast improves the model's capacity to identify subtle emotional inflections, especially in similar-sounding classes like "calm" vs. "happy" or "angry" vs. "fearful," even though MFCCs provide a condensed and popular representation of the speech spectrum. To increase recognition accuracy across a variety of speaker profiles, feature fusion is essential. This study also contrasts shallow and deep learning classifiers to highlight their advantages and disadvantages. Traditional classifiers, such as SVM and Random Forest, perform poorly when working with raw or complex features, despite being computationally light and efficient for small-scale systems. On the other hand, automatic feature learning and temporal modeling help the CNN and CNN-LSTM architectures capture complex prosody, rhythm, and tone patterns linked to emotional expressions.

  • Conference Article
  • Cite Count Icon 8
  • 10.33682/fx8n-cm43
Sound Event Classification and Detection with Weakly Labeled Data
  • Jan 1, 2019
  • Sharath Adavanne + 2 more

The Sound Event Classification (SEC) task involves recognizing the set of active sound events in an audio recording. The Sound Event Detection (SED) task involves, in addition to SEC, detecting the temporal onset and offset of every sound event in an audio recording. Generally, SEC and SED are treated as supervised classification tasks that require labeled datasets. SEC only requires weak labels, i.e., annotation of active sound events, without the temporal information, whereas SED requires strong labels, i.e., annotation of the onset and offset times of every sound event, which makes annotation for SED more tedious than for SEC. In this paper, we propose two methods for joint SEC and SED using weakly labeled data: a Fully Convolutional Network (FCN) and a novel method that combines a Convolutional Neural Network with an attention layer (CNNatt). Unlike most prior work, the proposed methods do not assume that the weak labels are active during the entire recording and can scale to large datasets. We report state-of-the-art SEC results obtained with the largest weakly labeled dataset - Audioset

  • Conference Article
  • Cite Count Icon 3
  • 10.1145/3675888.3676028
Environment Sound Classification using stacked features and convolutional neural network
  • Aug 8, 2024
  • Shilpa Gupta + 2 more

Environmental Sound Classification (ESC) finds a vital application in wildlife conservation, audio-video systems, music instrument classification, automatic speech recognition systems, sound event detection etc. Many conventional model training methods depending upon an enormous amount of annotated data, have been proposed in the literature for the same. The proposed technique focuses on the use of CNN for classifying short audio clips of environmental sounds. Existing models use auditory features like either Log-Mel spectrogram (LM) or Mel Frequency Cepstral Coefficient (MFCC) etc. for the classification or improvement, we have stacked all the different features visualized using the Librosa libraffiry such that it combines all the feature information into one image. The accuracy of the network is evaluated on the ESC-50 dataset of environmental and urban recordings. The stacked features seemed to perform better for the dataset chosen. Three stacked features are provided as input in form of channels to a transfer learned model, which outperforms the CNN models that trained from scratch. The highest precision and recall are obtained for Log-Mel Scale Spectrogram, Spectral Contrast and chroma features which is 95.91% and 95.81% respectively.

  • Research Article
  • Cite Count Icon 546
  • 10.1109/jstsp.2018.2885636
Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks
  • Dec 17, 2018
  • IEEE Journal of Selected Topics in Signal Processing
  • Sharath Adavanne + 3 more

| openaire: EC/H2020/637422/EU//EVERYSOUND

  • Research Article
  • Cite Count Icon 2
  • 10.1088/1742-6596/2010/1/012108
Weakly and semi-supervised learning for sound event detection using image pretrained convolutional recurrent neural network, weighted pooling and mean teacher method
  • Sep 1, 2021
  • Journal of Physics: Conference Series
  • Xichang Cai + 3 more

In this paper, we propose a sound event detection (SED) method which uses deep neural network trained on weak labeled and unlabeled data. The proposed method utilizes a convolutional recurrent neural network (CRNN) to extract high level features of audio clips. Inspired by the impressive performance of transfer learning in the field of image recognition, the convolutional neural network (CNN) in the proposed CRNN is an image-pretrained model. Although there is a significant difference between audio and image, the image-pretrained CNN still has competitive performance in SED and can effectively reduce the amount of training data needed. To learn from weak labeled data, the proposed method utilizes a weighted pooling strategy which enables the network to focus on the frames containing events in an audio clip. For unlabeled data, the proposed method utilizes a mean teacher semi-supervised learning method and data augmentation technique. To demonstrate the performance of the proposed method, we conduct the experimental evaluation using the DCASE2021 Task4 dataset. The experimental results demonstrate that the proposed method outperforms the DCASE2021 Task4 baseline method.

  • Research Article
  • Cite Count Icon 36
  • 10.1016/j.eswa.2022.118064
Classification of asphyxia infant cry using hybrid speech features and deep learning models
  • Jul 7, 2022
  • Expert Systems with Applications
  • Hua-Nong Ting + 2 more

Classification of asphyxia infant cry using hybrid speech features and deep learning models

  • Research Article
  • Cite Count Icon 6
  • 10.4018/ijirr.2021100103
Recognition of Musical Instrument Using Deep Learning Techniques
  • Oct 1, 2021
  • International Journal of Information Retrieval Research
  • Sangeetha Rajesh + 1 more

The proposed work investigates the impact of Mel Frequency Cepstral Coefficients (MFCC), Chroma DCT Reduced Pitch (CRP), and Chroma Energy Normalized Statistics (CENS) for instrument recognition from monophonic instrumental music clips using deep learning techniques, Bidirectional Recurrent Neural Networks with Long Short-Term Memory (BRNN-LSTM), stacked autoencoders (SAE), and Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM). Initially, MFCC, CENS, and CRP features are extracted from instrumental music clips collected as a dataset from various online libraries. In this work, the deep neural network models have been fabricated by training with extracted features. Recognition rates of 94.9%, 96.8%, and 88.6% are achieved using combined MFCC and CENS features, and 90.9%, 92.2%, and 87.5% are achieved using combined MFCC and CRP features with deep learning models BRNN-LSTM, CNN-LSTM, and SAE, respectively. The experimental results evidence that MFCC features combined with CENS and CRP features at score level revamp the efficacy of the proposed system.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.1109/access.2020.2974479
Research on Semi-Supervised Sound Event Detection Based on Mean Teacher Models Using ML-LoBCoD-NET
  • Jan 1, 2020
  • IEEE Access
  • Jinjia Wang + 3 more

One of the most commonly method for sound event detection is the traditional convolutional neural network (CNN) or convolutional recurrent neural network (CRNN) and their variants. However, the pooling operation of the CNN has the disadvantage of losing the location information of the target object. We don't use the pooling operation, retaining ReLU and convolution operation, and we use the dictionary strong constraints and penalty function prior constraints of the multi-layer convolutional sparse coding (ML-CSC). We proposed iterative deep neural networks, the unfolded multi-layer local block coordinate descent networks (ML-LoBCoD-NET), driven by the multi-layer local block coordinate descent algorithm (ML-LoBCoD) which is extended from the local block coordinate descent (LoBCoD) algorithm. The ML-LoBCoD-NET can extract features different from the CNN. More importantly, for weakly-supervised sound event detection task, we proposed the MRNN-Att network which combines the ML-LoBCoD-NET, a recurrent neural network (RNN), and an attention network. The MCRNN-Att network combines MRNN-Att and CRNN network for fusing the different features. Furthermore, for semi-supervised sound event detection task, the MRNN-Att mean teacher model (MRNN-Att-MT) and the MCRNN-Att mean teacher model (MCRNN-Att-MT) are proposed, in which the MRNN-Att and the MCRNN-Att network are selected as the student model. These models were tested on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 4. The F1 score of the MRNN-Att-MT on the development set was 22.83%, which was 8.77% higher than the baseline system. The score of the MRNN-Att-MT on the evaluation set was 15.68%, which was 4.88% higher than the baseline system. The MCRNN-Att-MT model had an F1 score of 20.35% on the development set, which was 6.29% higher than the baseline system and the F1 score of 14.56% on the evaluation set, which was 3.76% higher than the baseline system.

  • Research Article
  • Cite Count Icon 2
  • 10.12928/telkomnika.v18i5.14246
Sound event detection using deep neural networks
  • Oct 1, 2020
  • TELKOMNIKA (Telecommunication Computing Electronics and Control)
  • Suk-Hwan Jung + 1 more

We applied various architectures of deep neural networks for sound event detection and compared their performance using two different datasets. Feed forward neural network (FNN), convolutional neural network (CNN), recurrent neural network (RNN) and convolutional recurrent neural network (CRNN) were implemented using hyper-parameters optimized for each architecture and dataset. The results show that the performance of deep neural networks varied significantly depending on the learning rate, which can be optimized by conducting a series of experiments on the validation data over predetermined ranges. Among the implemented architectures, the CRNN performed best under all testing conditions, followed by CNN. Although RNN was effective in tracking the time-correlation information in audio signals,it exhibited inferior performance compared to the CNN and the CRNN. Accordingly, it is necessary to develop more optimization strategies for implementing RNN in sound event detection.

  • Research Article
  • Cite Count Icon 2
  • 10.7236/ijasc.2020.9.2.20
CNN based Sound Event Detection Method using NMF Preprocessing in Background Noise Environment
  • Jul 21, 2020
  • The International Journal of Advanced Smart Convergence
  • Bum-Suk Jang + 1 more

Sound event detection in real-world environments suffers from the interference of non-stationary and time-varying noise. This paper presents an adaptive noise reduction method for sound event detection based on non-negative matrix factorization (NMF). In this paper, we proposed a deep learning model that integrates Convolution Neural Network (CNN) with Non-Negative Matrix Factorization (NMF). To improve the separation quality of the NMF, it includes noise update technique that learns and adapts the characteristics of the current noise in real time. The noise update technique analyzes the sparsity and activity of the noise bias at the present time and decides the update training based on the noise candidate group obtained every frame in the previous noise reduction stage. Noise bias ranks selected as candidates for update training are updated in real time with discrimination NMF training. This NMF was applied to CNN and Hidden Markov Model(HMM) to achieve improvement for performance of sound event detection. Since CNN has a more obvious performance improvement effect, it can be widely used in sound source based CNN algorithm.

  • Research Article
  • Cite Count Icon 14
  • 10.1587/transinf.2020edp7036
Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning
  • Oct 16, 2020
  • IEICE Transactions on Information and Systems
  • Noriyuki Tonami + 3 more

Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene "office," the sound events "mouse clicking" and "keyboard typing" are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.

  • PDF Download Icon
  • Preprint Article
  • 10.21203/rs.3.rs-4487345/v1
Convolutional Automatic Identification of B-lines and Interstitial Syndrome in Lung Ultrasound Images Using Pre-Trained Neural Networks with Feature Fusion
  • Jun 18, 2024
  • Research Square
  • Khalid Moafa + 11 more

Background Interstitial/Alveolar Syndrome (IS) is a condition detectable on lung ultrasound (LUS) that indicates underlying pulmonary or cardiac diseases associated with significant morbidity and increased mortality rates. The diagnosis of IS using LUS can be challenging and time-consuming, and it requires clinical expertise. Methods In this study, multiple Convolutional Neural Network (CNN) deep learning (DL) models were trained, acting as binary classifiers, to accurately screen for IS from LUS frames by differentiating between IS-present and healthy cases. The CNN DL models were initially pre-trained using a generic image dataset to learn general visual features (ImageNet), and then fine-tuned on our specific dataset of 108 LUS clips from 54 patients (27 healthy and 27 with IS), with two clips per patient, to perform a binary classification task. Each frame within a clip was assessed to determine the presence of IS features or to confirm a healthy lung status. The dataset was split into training (70%), validation (15%), and testing (15%) sets. Following the process of fine-tuning, we successfully extracted features from pre-trained DL models. These extracted features were utilised to train multiple machine learning (ML) classifiers, hence the trained ML classifiers yielded significantly improved accuracy in IS classification. Advanced visual interpretation techniques, such as heatmaps based on Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-Agnostic explanations (LIME), were implemented to further analyse the outcomes. Results The best-trained ML model achieved a test accuracy of 98.2%, with specificity, recall, precision, and F1-score values all above 97.9%. Our study demonstrates, for the first time, the feasibility of using a pre-trained CNN with the feature extraction and fusion technique as a diagnostic tool for IS screening on LUS frames, providing a time-efficient and practical approach to clinical decision-making. Conclusion This study confirms the practicality of using pre-trained CNN models, with the feature extraction and fusion technique, for screening IS through LUS frames. This represents a noteworthy advancement in improving the efficiency of diagnosis. In the next steps, validation on larger datasets will assess the applicability and robustness of these CNN models in more complex clinical settings.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant