LSKFDY-CNN: Large selective kernel frequency dynamic convolutional neural network for sound event detection
LSKFDY-CNN: Large selective kernel frequency dynamic convolutional neural network for sound event detection
- Conference Article
3
- 10.1109/icassp49357.2023.10096621
- Jun 4, 2023
CNN+RNN models have become the mainstream approach for semi-supervised sound event detection, and the CNN part is mainly a stack of several 2D convolutional layers to capture the representations of the time-frequency features. However, conventional 2D convolution is of limited ability in capturing detailed information about acoustic events. In this paper, to enhance the representation ability of CNN, we propose NAS-DYMC, a NAS-based dynamic multi-scale convolutional neural network to extract a more effective acoustic representation. Specifically, multi-scale convolution can capture the characteristics of sound events with different time-frequency distributions and dynamic convolution enhances the representation capability of conventional convolution by adapting attention weights onto basis kernels. Furthermore, a neural architecture search (NAS) method is adopted to find the optimal network architecture from the search space consisting of various dynamic multi-scale convolutions for the DCASE 2021 Task4 dataset. Experimental results demonstrate the superiority of our proposed method.
- Conference Article
46
- 10.1109/icassp40776.2020.9053045
- May 1, 2020
Polyphonic sound event detection and direction-of-arrival estimation require different input features from audio signals. While sound event detection mainly relies on time-frequency patterns, direction-of-arrival estimation relies on magnitude or phase differences between microphones. Previous approaches use the same input features for sound event detection and direction-of-arrival estimation, and train the two tasks jointly or in a two-stage transfer-learning manner. We propose a two-step approach that decouples the learning of the sound event detection and directional-of-arrival estimation systems. In the first step, we detect the sound events and estimate the directions-of-arrival separately to optimize the performance of each system. In the second step, we train a deep neural network to match the two output sequences of the event detector and the direction-of-arrival estimator. This modular and hierarchical approach allows the flexibility in the system design, and increase the performance of the whole sound event localization and detection system. The experimental results using the DCASE 2019 sound event localization and detection dataset show an improved performance compared to the previous state-of-the-art solutions.
- Research Article
26
- 10.1007/s00521-004-0429-9
- Sep 18, 2004
- Neural Computing and Applications
Intelligent systems cover a wide range of technologies related to hard sciences, such as modeling and control theory, and soft sciences, such as the artificial intelligence (AI). Intelligent systems, including neural networks (NNs), fuzzy logic (FL), and wavelet techniques, utilize the concepts of biological systems and human cognitive capabilities. These three systems have been recognized as a robust and attractive alternative to the some of the classical modeling and control methods. The application of classical NNs, FL, and wavelet technology to dynamic system modeling and control has been constrained by the non-dynamic nature of their popular architectures. The major drawbacks of these architectures are the curse of dimensionality, such as the requirement of too many parameters in NNs, the use of large rule bases in FL, the large number of wavelets, and the long training times, etc. These problems can be overcome with dynamic network structures, referred to as dynamic neural networks (DNNs), dynamic fuzzy networks (DFNs), and dynamic wavelet networks (DWNs), which have unconstrained connectivity and dynamic neural, fuzzy, and wavelet processing units, called “neurons”, “feurons”, and “wavelons”, respectively. The structure of dynamic networks are based on Hopfield networks. Here, we present a comparative study of DNNs, DFNs, and DWNs for non-linear dynamical system modeling. All three dynamic networks have a lag dynamic, an activation function, and interconnection weights. The network weights are adjusted using fast training (optimization) algorithms (quasi-Newton methods). Also, it has been shown that all dynamic networks can be effectively used in non-linear system modeling, and that DWNs result in the best capacity. But all networks have non-linearity properties in non-linear systems. In this study, all dynamic networks are considered as a non-linear optimization with dynamic equality constraints for non-linear system modeling. They encapsulate and generalize the target trajectories. The adjoint theory, whose computational complexity is significantly less than the direct method, has been used in the training of the networks. The updating of weights (identification of network parameters) is based on Broyden–Fletcher–Goldfarb–Shanno method. First, phase portrait examples are given. From this, it has been shown that they have oscillatory and chaotic properties. A dynamical system with discrete events is modeled using the above network structure. There is a localization property at discrete event instants for time and frequency in this example.
- Conference Article
47
- 10.21437/eurospeech.1997-708
- Sep 22, 1997
This paper presents new methods for training large neural networks for phoneme probability estimation. A combination of the time-delay architecture and the recurrent network architecture is used to capture the important dynamic information of the speech signal. Motivated by the fact that the number of connections in fully connected recurrent networks grows super-linear with the number of hidden units, schemes for sparse connection and connection pruning are explored. It is found that sparsely connected networks outperform their fully connected counterparts with an equal or smaller number of connections. The networks are evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT database. The achieved phoneme error-rate, 28.3%, for the standard 39 phoneme set on the core testset of the TIMIT database is not far from the lowest reported. All training and simulation software used is made freely available by the author, making reproduction of the results feasible.
- Research Article
546
- 10.1109/jstsp.2018.2885636
- Dec 17, 2018
- IEEE Journal of Selected Topics in Signal Processing
| openaire: EC/H2020/637422/EU//EVERYSOUND
- Research Article
- 10.17586/2226-1494-2024-24-5-758-769
- Oct 1, 2024
- Scientific and Technical Journal of Information Technologies, Mechanics and Optics
The task of automatic metainformation recognition from audio sources is to detect and extract data of various natures (speech, noises, acoustic scenes, acoustic events, anomalies) from a given audio input signal. This area is well developed and known to the scientific community and has various approaches with high quality. But, the vast majority of such methods are based on large neural networks with a huge number of weights to be trained. Subsequently, it is impractical to use them in environments with severely limited computing resources. The smart device industry is currently growing rapidly: smartphones, smart watches, voice assistants, TV, smart home. Such products have limitations in both processor and memory. At that moment, the State-of-the-Art way to cope with these conditions is to use so-called low-complexity models. Moreover, in recent years, the interest of the scientific community in the above-mentioned problem has been growing (DCASE Workshop). One of the most crucial subtasks in the global meta information recognition problem is the task of Automatic Scene Classification and the task of Sound Event Detection. The most important scientific questions are the development of both the optimal low-complexity neural network architecture and learning algorithms to obtain a low-resource, high-quality system for classifying acoustic scenes and detecting sound events. In this paper the datasets from DCASE Challenge “Low-Complexity Acoustic Scene Classification” and “Sound Event Detection with Weak Labels and Synthetic Soundscapes” were used. A multitask neural network architecture was proposed consisting of a common encoder and two independent decoders for each of the two tasks. The classical algorithms of multitask learning SoftMTL and HardMTL were considered, and their modifications were developed: CrossMTL, which is based on the idea of reusing data from one task when training the decoder to solve the second task, and FreezeMTL, in which the trained weights of the common encoder are frozen after training on the first task and used to optimize the second decoder. As a result of the experiments, it was shown that the use of the CrossMTL modification can significantly increase the accuracy of the classification of acoustic scenes and event detection in compare with classical approaches SoftMTL and HardMTL. The FreezeMTL algorithm made it possible to obtain a model that provides 42.44 % accuracy in scene classification and 45.86 % accuracy in event detection, which is comparable to the results of the baseline solutions of 2023. In this paper, a low-complexity neural network consisting of 633.5 K trainable parameters was proposed, requiring 43.2 M MACs to process one second audio. This approach uses 7.8 % fewer trainable parameters and 40 % fewer MACs compared to the naive application of two independent models. The developed model can be used in smart devices due to a small number of trainable parameters, as well as a small number of MACs required for its application.
- Research Article
- 10.7494/cmms.2006.2.0104
- Jan 1, 2006
- Computer Methods in Materials Science
The main objective of the work is evaluation of effectiveness of the dynamic neural networks in modelling of the copper flash smelting process. The fundamentals of the dynamic neural networks are presented in the paper. This type of neural networks was tested in solving the theoretical problem with time-lag. Next, the dynamic neural networks were applied to prediction of the chosen output parameters of the copper flash smelting process. The copper flash smelting process is very complex and there are many input and output parameters which should be consider in modelling and control of the process. Some of the output process parameters are dependent on the history of the changes of the input parameters. Moreover, some parameters can react to the changes of input parameters with delay, but the values of delays are unknown. This situation causes many problems in modelling of this metallurgical process. The work presents the comparison of the results obtained by dynamic and static neural networks in prediction of the temperature of exhaust gases. The obtained results confirm that the dynamic neural network model can predict output parameters of the copper flash smelting process with high accuracy. Moreover, the dynamic neural networks give the possibility to identify the delays in reaction of the output process parameters to the changes of the input parameters. The obtained results has shown that dynamic neural networks are a very useful tool in modelling of complex metallurgical processes.
- Book Chapter
- 10.1007/978-3-540-87559-8_60
- Sep 3, 2008
Recurrent neural networks unlike feed-forward networks are able to process inputs with time context. The key role in this process is played by the dynamics of the network, which transforms input data to the recurrent layer states. Several authors have described and analyzed dynamics of small sized recurrent neural networks with two or three hidden units. In our work we introduce techniques that allow to visualize and analyze the dynamics of large recurrent neural networks with dozens units, reveal both stable and unstable points (attractors and saddle points), which are important to understand the principles of successful task processing. As a practical example of this approach, dynamics of the simple recurrent network trained by two different training algorithms on context-free language anbnwas studied.
- Research Article
3
- 10.1016/j.asoc.2024.112444
- Nov 14, 2024
- Applied Soft Computing
A sparse diverse-branch large kernel convolutional neural network for human activity recognition using wearables
- Research Article
10
- 10.1007/s11042-018-7142-7
- Jan 9, 2019
- Multimedia Tools and Applications
A smart environment is one of the application scenarios of the Internet of Things (IoT). In order to provide a ubiquitous smart environment for humans, a variety of technologies are developed. In a smart environment system, sound event detection is one of the fundamental technologies, which can automatically sense sound changes in the environment and detect sound events that cause changes. In this paper, we propose the use of Relational Recurrent Neural Network (RRNN) for polyphonic sound event detection, called RRNN-SED, which utilized the strength of RRNN in long-term temporal context extraction and relational reasoning across a polyphonic sound signal. Different from previous sound event detection methods, which rely heavily on convolutional neural networks or recurrent neural networks, the proposed RRNN-SED method can solve long-lasting and overlapping problems in polyphonic sound event detection. Specifically, since the historical information memorized inside RRNNs is capable of interacting with each other across a polyphonic sound signal, the proposed RRNN-SED method is effective and efficient in extracting temporal context information and reasoning the unique relational characteristic of the target sound events. Experimental results on two public datasets show that the proposed method achieved better sound event detection results in terms of segment-based F-score and segment-based error rate.
- Research Article
8
- 10.1371/journal.pone.0208370
- Jan 30, 2019
- PLoS ONE
As interdisciplinary branches of ecology are developing rapidly in the 21st century, contents of ecological researches have become more abundant than ever before. Along with the exponential growth of number of published literatures, it is more and more difficult for ecologists to get a clear picture of their discipline. Nevertheless, the era of big data has brought us massive information of well documented historical literature and various techniques of data processing, which greatly facilitates the implementation of bibliometric analysis on ecology. Frequency has long been used as the primary metric in keyword analysis to detect ecological hotspots, however, this method could be somewhat biased. In our study, we have suggested a method called PAFit to measure keyword popularity, which considered ecology-related topics in a large temporal dynamical knowledge network, and found out the popularity of ecological topics follows the “rich get richer” and “fit get richer” mechanism. Feasibility of network analysis and its superiority over simply using frequency had been explored and justified, and PAFit was testified by its outstanding performance of prediction on the growth of frequency and degree. In addition, our research also encourages ecologists to consider their domain knowledge in a large dynamical network, and be ready to participate in interdisciplinary collaborations when necessary.
- Conference Article
- 10.1117/12.56904
- Mar 1, 1992
- Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
An established way to synthesize associative memory networks is to use dynamical neural networks. For large dimensional problems, the dynamical networks usually are computationally burdensome to design and generally introduce spurious memories. A new architecture that consists of an input linear filter, a hidden layer of dynamical network and an output linear filter is proposed in this paper to alleviate some of the difficulties in designing large dimensional dynamical networks. A learning rule and its simplified version are presented for the design of the network parameters.© (1992) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.
- Conference Article
12
- 10.23919/eusipco47968.2020.9287372
- Jan 24, 2021
This paper proposes a sound event localization and detection (SELD) method using a convolutional recurrent neural network (CRNN) with gated linear units (GLUs). The proposed method introduces to employ GLUs with convolutional neural network (CNN) layers of the CRNN to extract adequate spectral features from amplitude and phase spectra. When the CNNs extract features of high-dimensional dependencies of frequency bins, the GLUs weight the extracted features based on the importance of the bins, like attention mechanism. Extracted features from bins where sounds are absent, which is not informative and degrade the SELD performance, are weighted to 0 and ignored by GLUs. Only the features extracted from informative bins are used for the CNN output for better SELD performance. Obtained CNN outputs are fed to consecutive bi-directional gated recurrent units (GRUs), which capture temporal information. Finally, the GRU output are shared by two task-specific layers, which are sound event detection (SED) layers and direction of arrival (DoA) estimation layers, to obtain SELD results. Evaluation results using the TAU Spatial Sound Events 2019 - Ambisonic dataset show the effectiveness of GLUs in the proposed method, and it improves SELD performance up to 0.10 in F1-score, 0.15 in error rate, 16.4° in DoA estimation error comparing to a CRNN baseline method.
- Research Article
- 10.4236/ijis.2013.34016
- Jan 1, 2013
- International Journal of Intelligence Science
Brain-like computer research and development have been growing rapidly in recent years. It is necessary to design large scale dynamical neural networks (more than 106 neurons) to simulate complex process of our brain. But such kind of task is not easy to achieve only based on the analysis of partial differential equations, especially for those complex neural models, e.g. Rose-Hindmarsh (RH) model. So in this paper, we develop a novel approach by combining fuzzy logical designing with Proximal Support Vector Machine Classifiers (PSVM) learning in the designing of large scale neural networks. Particularly, our approach can effectively simplify the designing process, which is crucial for both cognition science and neural science. At last, we conduct our approach on an artificial neural system with more than 108 neurons for haze-free task, and the experimental results show that texture features extracted by fuzzy logic can effectively increase the texture information entropy and improve the effect of haze-removing in some degree.
- Conference Article
10
- 10.1109/iscslp49672.2021.9362116
- Jan 24, 2021
In this paper, we propose a model ensemble approach for sound event localization and detection (SELD). We adopt several deep neural network (DNN) architectures to perform sound event detection (SED) and direction-of-arrival (DOA) estimation simultaneously. Generally, the DNN architecture consists of three modules stacked together, i.e, a High-level Feature Representation module, a Temporal Context Representation module, and a Fully-connected module in the end. The High-level Feature Representation module usually contains a series of convolutional neural network (CNN) layers to extract useful local features. The Temporal Context Representation module aims to model longer temporal context dependency in the extracted features. There are two parallel branches in the Fully-connected module with one for SED estimation and the other for DOA estimation. With different combinations of implementation in the High-level Feature Representation module and Temporal Context Representation module, several network architectures are used for the SELD task. At last, a more robust prediction of SED and DOA is obtained by model ensemble and post-processing. Tested on the development and evaluation datasets, the proposed approach achieves promising results and ranks the first place in DCASE 2020 task3 challenge. Index Terms: sound event localization and detection, deep neural network, model ensemble