Distributed Audio-Visual Parsing Based On Multimodal Transformer and Deep Joint Source Channel Coding

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Audio-visual parsing (AVP) is a newly emerged multimodal perception task, which detects and classifies audio-visual events in video. However, most existing AVP networks only use a simple attention mechanism to guide audio-visual multimodal events, and are implemented in a single end. This makes it unable to effectively capture the relationship between audio-visual events, and is not suitable for implementation in the network transmission scenario. In this paper, we focus on these problems and propose a distributed audio-visual parsing network (DAVPNet) based on multimodal transformer and deep joint source channel coding (DJSCC). Multimodal transformers are used to enhance the attention calculation between audio-visual events, and DJSCC is used to apply DAVP tasks to network transmission scenarios. Finally, the Look, Listen, and Parse (LLP) dataset is used to test the algorithm performance, and the experimental results show that the DAVPNet has superior parsing performance.

Similar Papers
  • Conference Article
  • Cite Count Icon 2
  • 10.1109/ijcnn55064.2022.9892495
Look longer to see better: Audio-visual event localization by exploiting long-term correlation
  • Jul 18, 2022
  • Longyin Guo + 2 more

Visual and auditory modalities both contain a large amount of rich information about audio-visual events. While the human perception system can effectively fuse the information of the dual modalities in recognizing events, it is still an open issue how to effectively integrate dual-modal information for the task of automatic localization of audio-visual events in videos. In this paper, we propose an audio-visual long-term correlation network to capture the longer correlation of audio and visual features, which is underused by existing methods. To this end, we first propose the time-spatial guided attention (TSGA) module, which locates the spatial region of the audio-visual events in the video and focuses on continuous changes in that location. We then propose the positive time residual fusion (PTRF) module, which encodes the temporal correlation matrix of video and audio, and uses residual fusion to combine audio and visual features. We finally evaluate our method for the fully supervised and weakly supervised tasks on the AVE dataset. The results prove the superiority of our method over its counterparts.

  • Research Article
  • Cite Count Icon 3
  • 10.1609/aaai.v39i7.32784
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Yunlong Tang + 5 more

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,081 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

  • Conference Article
  • Cite Count Icon 151
  • 10.1109/cvpr52688.2022.00493
End-to-End Referring Video Object Segmentation with Multimodal Transformers
  • Jun 1, 2022
  • Adam Botach + 2 more

The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps. As such, it simplifies the RVOS pipeline considerably compared to existing methods. Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76 frames per second. In addition, we report strong results on the public validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has yet to receive the attention of researchers. The code to reproduce our experiments is avail-able at https://github.com/mttr2021/MTTR.

  • Video Transcripts
  • 10.48448/fqar-3406
Multimodal Phased Transformer for Sentiment Analysis
  • Oct 15, 2021
  • Underline Science Inc.
  • Junyan Cheng + 1 more

Multimodal Transformers achieve superior performance in multimodal learning tasks. However, the quadratic complexity of the self-attention mechanism in Transformers limits their deployment in low-resource devices and makes their inference and training computationally expensive. We propose multimodal Sparse Phased Transformer (SPT) to alleviate the problem of self-attention complexity and memory footprint. SPT uses a sampling function to generate a sparse attention matrix and compress a long sequence to a shorter sequence of hidden states. SPT concurrently captures interactions between the hidden states of different modalities at every layer. To further improve the efficiency of our method, we use Layer-wise parameter sharing and Factorized Co-Attention that share parameters between Cross Attention Blocks, with minimal impact on task performance. We evaluate our model with three sentiment analysis datasets and achieve comparable or superior performance compared with the existing methods, with a 90% reduction in the number of parameters. We conclude that (SPT) along with parameter sharing can capture multimodal interactions with reduced model size and improved sample efficiency.

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/icme51207.2021.9428081
Multimodal Transformer Networks with Latent Interaction for Audio-Visual Event Localization
  • Jul 5, 2021
  • Yixuan He + 4 more

The task of audio-visual event localization (AVEL) aims to localize a visible and audible event in a video. Previous methods first divide a video into segments and then fuse visual and acoustic features at the segment level via a co-attention mechanism. However, existing methods mostly model relations between individual visual and audio segments in a limitedly short period, which may not cover a longer video duration for better high-level event information modeling. In this paper, we proposed a novel model termed Multimodal Transformer Network with Latent Interaction (MTNLI) to tackle this problem. The proposed MTNLI model employs a multimodal Transformer structure to learn the cross-modality relationships between latent visual and audio summarizations in long segment sequences, which summarize the visual and audio segments into a small number of latent representations to avoid modeling uninformative individual visual-audio relations. The cross-modality information between the latent summarizations is propagated to fuse valuable information from both modalities, which can effectively handle large temporal inconsistent between vision and audio. Our MTNLI method achieves state-of-the-art performance on the benchmark AVE (Audio-Visual Event) dataset for the event localization task.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.3390/app122412622
Self-Supervised Video Representation and Temporally Adaptive Attention for Audio-Visual Event Localization
  • Dec 9, 2022
  • Applied Sciences
  • Yue Ran + 3 more

Localizing the audio-visual events in video requires a combined judgment of visual and audio components. To integrate multimodal information, existing methods modeled the cross-modal relationships by feeding unimodal features into attention modules. However, these unimodal features are encoded in separate spaces, resulting in a large heterogeneity gap between modalities. Existing attention modules, on the other hand, ignore the temporal asynchrony between vision and hearing when constructing cross-modal connections, which may lead to the misinterpretation of one modality by another. Therefore, this paper aims to improve event localization performance by addressing these two problems and proposes a framework that feeds audio and visual features encoded in the same semantic space into a temporally adaptive attention module. Specifically, we develop a self-supervised representation method to encode features with a smaller heterogeneity gap by matching corresponding semantic cues between synchronized audio and visual signals. Furthermore, we develop a temporally adaptive cross-modal attention based on a weighting method that dynamically channels attention according to the time differences between event-related features. The proposed framework achieves state-of-the-art performance on the public audio-visual event dataset and the experimental results not only show that our self-supervised method can learn more discriminative features but also verify the effectiveness of our strategy for assigning attention.

  • Research Article
  • 10.1109/tnnls.2025.3600878
Fine-Grained Audio-Visual Event Localization.
  • Jan 1, 2026
  • IEEE transactions on neural networks and learning systems
  • Baoyu Fan + 5 more

Audio-visual event localization (AVEL) aims to recognize events in videos by associating audio-visual information. However, events involved in existing AVEL tasks are usually coarse-grained events. Actually, finer-grained events are sometimes necessary to be distinguished, especially in certain expert-level applications or rich-content-generation studies. However, this is challenging because they are more difficult to detect or distinguish compared with coarse-grained events. To better address this problem, we discuss a new setting of fine-grained AVEL from dataset to method. First, we constructed the first fine-grained audio-visual event dataset, which is called IT-AVE, relying on videos of playing musical instruments, containing 13k video clips and over 52k audio-visual events. All events are labeled from professional music practitioners, and the event categories are all derived from playing techniques, which are fine-grained with little interclass variation. Next, we designed a new fine-grained event localization method, spatial-temporal video event detector (SVED), which focuses on the challenges that fine-grained events are more imperceptible and prone to be disturbed. Finally, we conduct extensive experiments based on the proposed IT-AVE dataset versus fine-grained versions of two existing related datasets, including UnAV-22 derived from UnAV-100 and FineAction-AV derived from FineAction. Experimental results demonstrate the effectiveness of our method. We hope that this work will contribute to the exploration of an integrated understanding of audio-visual videos.

  • Research Article
  • Cite Count Icon 4
  • 10.1177/20552076241305168
Assessing severity of pediatric pneumonia using multimodal transformers with multi-task learning
  • Jan 1, 2024
  • Digital Health
  • Jing Li + 10 more

ObjectiveWhile current multimodal approaches in the diagnosis and severity assessment of pneumonia demonstrate remarkable performance, they frequently overlook the issue of modality absence—a common challenge in clinical practice. Thus, we present the robust multimodal transformer (RMT) model, crafted to bridge this gap. The RMT model aims to enhance diagnosis and severity assessment accuracy in situations with incomplete data, thereby ensuring it meets the complex needs of real-world clinical settings.MethodThe RMT model leverages multimodal data, integrating X-ray images and clinical text data through a sophisticated AI-driven framework. It employs a Transformer-based architecture, enhanced by multi-task learning and mask attention mechanism. This approach aims to optimize the model’s performance across different modalities, particularly under conditions of modality absence.ResultsThe RMT model demonstrates superior performance over traditional diagnostic methods and baseline models in accuracy, precision, sensitivity, and specificity. In tests involving various scenarios, including single-modal and multimodal tasks, the model shows remarkable robustness in handling incomplete data. Its effectiveness is further validated through extensive comparative analysis and ablation studies.ConclusionThe RMT model represents a substantial advancement in pediatric pneumonia severity assessment. It successfully harnesses multimodal data and advanced AI techniques to improve assessment precision. While the RMT model sets a new precedent in AI applications in medical diagnostics, the development of a comprehensive pediatric pneumonia dataset marks a pivotal contribution, providing a robust foundation for future research.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 802
  • 10.1109/tpami.2023.3275156
Multimodal Learning With Transformers: A Survey.
  • Oct 1, 2023
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Peng Xu + 2 more

Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and Big Data, Transformer-based multimodal learning has become a hot topic in AI research. This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal Big Data era, (2) a systematic review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two important paradigms, i.e., for multimodal pretraining and for specific multimodal tasks, (4) a summary of the common challenges and designs shared by the multimodal Transformer models and applications, and (5) a discussion of open problems and potential research directions for the community.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 23
  • 10.3389/frai.2021.767971
What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations.
  • Dec 3, 2021
  • Frontiers in Artificial Intelligence
  • Nikolai Ilinykh + 1 more

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

  • Research Article
  • Cite Count Icon 152
  • 10.1109/tmi.2022.3180228
Multimodal Transformer for Accelerated MR Imaging.
  • Oct 1, 2023
  • IEEE Transactions on Medical Imaging
  • Chun-Mei Feng + 6 more

Accelerated multi-modal magnetic resonance (MR) imaging is a new and effective solution for fast MR imaging, providing superior performance in restoring the target modality from its undersampled counterpart with guidance from an auxiliary modality. However, existing works simply combine the auxiliary modality as prior information, lacking in-depth investigations on the potential mechanisms for fusing different modalities. Further, they usually rely on the convolutional neural networks (CNNs), which is limited by the intrinsic locality in capturing the long-distance dependency. To this end, we propose a multi-modal transformer (MTrans), which is capable of transferring multi-scale features from the target modality to the auxiliary modality, for accelerated MR imaging. To capture deep multi-modal information, our MTrans utilizes an improved multi-head attention mechanism, named cross attention module, which absorbs features from the auxiliary modality that contribute to the target modality. Our framework provides three appealing benefits: (i) Our MTrans use an improved transformers for multi-modal MR imaging, affording more global information compared with existing CNN-based methods. (ii) A new cross attention module is proposed to exploit the useful information in each modality at different scales. The small patch in the target modality aims to keep more fine details, the large patch in the auxiliary modality aims to obtain high-level context features from the larger region and supplement the target modality effectively. (iii) We evaluate MTrans with various accelerated multi-modal MR imaging tasks, e.g., MR image reconstruction and super-resolution, where MTrans outperforms state-of-the-art methods on fastMRI and real-world clinical datasets.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.neucom.2021.03.026
Accelerated masked transformer for dense video captioning
  • Mar 16, 2021
  • Neurocomputing
  • Zhou Yu + 1 more

Accelerated masked transformer for dense video captioning

  • Research Article
  • 10.56947/amcs.v28.551
Forecasting daily oil prices using a multi-modal transformer with sentiment-guided attention
  • Jun 3, 2025
  • Annals of Mathematics and Computer Science
  • Ikhlaas Gurrib + 2 more

This study presents a novel forecasting framework, the Multi-Modal Transformer with Sentiment-Guided Attention (MMT-SGA), designed to enhance daily oil price predictions. Recognizing the limitations of traditional linear and statistical methods in capturing oil price volatility, the proposed model integrates structured numerical data and unstructured textual sentiment analysis through advanced transformer architectures. The sentiment-guided attention mechanism dynamically adjusts predictions based on real-time sentiment volatility, significantly improving forecasting accuracy and responsiveness. Comprehensive numerical experiments conducted over a decade of data (2015–2024) demonstrate the model’s superior performance compared to established methods such as ARIMA, Random Forest, XGBoost, LSTM, and TCN. Results highlight MMT-SGA's robustness, interpretability, and adaptability in complex and volatile market environments, underscoring its potential for informed decision-making in economic and policy contexts.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/aike52691.2021.00022
Audio-Visual Event Localization based on Cross-Modal Interacting Guidance
  • Dec 1, 2021
  • Qiurui Yue + 2 more

This paper studies the audio-visual event localization task, which requires the machine to locate the start and end time of the visual and audio events in the unconstrained video at the same time and identify the event category. To address this task, we propose a cross-modal interacting guidance network. Unlike previous works, it can model the complex relationship within the modality through the audio and video interacting guidance mechanism. Specifically, our cross-modal interacting guidance network is mainly composed of the cross-modal relation-aware network used as the baseline and the audio-visual interacting guidance module we joined. The cross-modal interacting guidance module (CMIG) can dynamically adjust the intra-modal attention of the target modality based on the attention flow of another modality, which is very important for modeling the complex relationships within the modality. Experiments show that our framework achieves the state-of-the-art performance in both full supervised and weakly supervised settings on the Audio-Visual Event Location (AVE) dataset.

  • Conference Article
  • Cite Count Icon 32
  • 10.1109/isscc42615.2023.10067842
16.1 MuITCIM: A 28nm $2.24 \mu\mathrm{J}$/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers
  • Feb 19, 2023
  • Fengbin Tu + 7 more

Human perception is multimodal and able to comprehend a mixture of vision, natural language, speech, etc. Multimodal Transformer (MuIT, Fig. 16.1.1) models introduce a cross-modal attention mechanism to vanilla transformers to learn from different modalities, achieving excellent results on multimodal AI tasks like video question answering and multilingual image retrieval. Transformers require specialized hardware for efficient inference [1]. Prior work demonstrates that a Compute-In-Memory (CIM) accelerator with attention sparsity can efficiently process vanilla transformers [2]. Multimodal signals like video and audio exhibit diverse token significance, providing new opportunities for token sparsity via runtime pruning [3]. Additionally, activation functions like GELU and softmax produce many near-zero values that expose bit sparsity in the most-significant bits (MSB). In utilizing attention-token-bit hybrid sparsity, there are three challenges: 1) For attention sparsity, irregular patterns result in long reuse distance, which requires CIM to hold infrequently used weights, lowering CIM utilization. 2) Although token sparsity reduces computation, MuIT's cross-modal attention processes tokens from two modalities with different token lengths (N) and embedding dimensionality <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\mathrm{d}_{\mathrm{m}})$</tex> , causing high latency in cross-modal switch. 3) At the bit level, since token sparsity reduces value locality, a CIM macro has more variance in effective bitwidth for the same group of inputs. In a conventional CIM's bit-serial MAC scheme, computation time is defined by the longest bitwidth.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant