Cross-modal learning representation using new margin combination for speech recognition task

D Karim,M Abdelkarim

doi:10.22201/icat.24486736e.2024.22.3.2417

Abstract

Cross-modal retrieval aims to elucidate the fusion of information, mimic human learning, and advance the field. The main challenge in cross-modal matching is to build a shared subspace reflecting semantic proximity. Previous works fail to capture asymmetric relevance by adopting symmetric similarity computations. To overcome these shortcomings, an efficient approach called quaternion representation learning (QRL) is introduced for better cross-modal matching. Thus, a better representation of the shared semantics is offered by virtue of its richer representation capacity of the quaternionic space and its strong expressive power. Transfer learning is a crucial aspect in this context. By leveraging pre-trained models, the knowledge gained from one task or domain can be effectively transferred to another, allowing for improved performance and generalization. In this study, transfer learning is employed to enhance the cross-modal retrieval system. Specifically, a pre-trained ResNet-512 model is utilized in conjunction with the proposed total margin (TM) loss function, which combines the QRL approach with the novel adaptive mean margin (AMM) methodology. The TM loss function, coupled with the pre-trained ResNet-512 model, is evaluated on the Audio-Visual Arabic Speech Database (AVAS) and the Arabic Visual Speech Database (AVSD), along with other audio-visual datasets. Experimental results demonstrate the effectiveness of the TM loss function in consistently improving performance on both databases. The recall scores (R@k) and mean average precision (mAP) values achieved on the AVAS Database are as follows: R@1: 42.1±0.7, R@2: 70.2±0.1, R@5: 78.5±1.0, and mAP: 53.0±1.1. Similarly, on the AVSD Database, the results are R@1: 41.7±0.3, R@2: 69.2±1.1, R@5: 78.0±0.3, and mAP: 52.7±0.5. By incorporating transfer learning and the TM loss function into the cross-modal retrieval framework, this study demonstrates the potential for improving clustering efficiency and enhancing visual and speech understanding. The combination of pre-trained models and the TM loss function offers a promising avenue for advancing crossmodal matching techniques and achieving state-of-the-art performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-modal learning representation using new margin combination for speech recognition task

Abstract

Talk to us

Similar Papers

More From: Journal of Applied Research and Technology

Lead the way for us

Journal: Journal of Applied Research and Technology	Publication Date: Jun 28, 2024
License type: CC BY-NC-ND 4.0

Similar Papers

Transfer Learning and Fine-Tuning for Deep Learning-Based Tea Diseases Detection on Small Datasets
Ade Ramdan ... Ana Heryana
-
Ade Ramdan, et. al.Ade Ramdan ... Ana Heryana
18 Nov 2020
18 Nov 2020

Bridging satellite missions: deep transfer learning for enhanced tropical cyclone intensity estimation
Minki Choo ... Il-Ju Moon
GIScience & Remote Sensing | VOL. 61
Minki Choo, et. al.Minki Choo ... Il-Ju Moon
11 Mar 2024
GIScience & Remote Sensing | VOL. 61

Identifying Images with Ladders Using Deep CNN Transfer Learning
Gaurav Pandey ... Arvind Baranwal
-
Gaurav Pandey, et. al.Gaurav Pandey ... Arvind Baranwal
17 Jul 2019
17 Jul 2019

Automated Brain Image Classification Based on VGG-16 and Transfer Learning
Taranjit Kaur ... Tapan Kumar Gandhi
-
Taranjit Kaur, et. al.Taranjit Kaur ... Tapan Kumar Gandhi
01 Dec 2019
01 Dec 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-modal learning representation using new margin combination for speech recognition task

Abstract

Talk to us

Similar Papers

More From: Journal of Applied Research and Technology