DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification

Tony Alex,Sara Ahmed,Muhammad Awais,Armin Mustafa,Philip Jb Jackson

doi:10.1609/aaai.v38i16.29716

Abstract

Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pretrained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Receptive Field Regularization Techniques for Audio Classification and Tagging With Deep Convolutional Neural Networks
Khaled Koutini ... Gerhard Widmer
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29
Khaled Koutini, et. al.Khaled Koutini ... Gerhard Widmer
01 Jan 2020
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 29

Learning Temporal Resolution in Spectrogram for Audio Classification
Haohe Liu ... Qiuqiang Kong
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Haohe Liu, et. al.Haohe Liu ... Qiuqiang Kong
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Ensemble of convolutional neural networks to improve animal audio classification
Loris Nanni ... Yandre M G Costa
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2020
Loris Nanni, et. al.Loris Nanni ... Yandre M G Costa
26 May 2020
EURASIP Journal on Audio, Speech, and Music Processing | VOL. 2020

CNN architectures for large-scale audio classification
Shawn Hershey ... Sourish Chaudhuri
-
Shawn Hershey, et. al.Shawn Hershey ... Sourish Chaudhuri
01 Mar 2017
01 Mar 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence