Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition.

Minsu Kim,Hyung-Il Kim,Yong Man Ro

doi:10.1109/tpami.2024.3484658

Abstract

Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on pattern analysis and machine intelligence

Lead the way for us

Similar Papers

AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse
Ying Li ... Ali Ahmadian
Applied Soft Computing | VOL. 164
Ying Li, et. al.Ying Li ... Ali Ahmadian
28 Jun 2024
Applied Soft Computing | VOL. 164

Speaker-Adaptive Lip Reading with User-Dependent Padding
Minsu Kim ... Hyunjun Kim
-
Minsu Kim, et. al.Minsu Kim ... Hyunjun Kim
01 Jan 2021
01 Jan 2021

Visual Lip-Reading for Quranic Arabic Alphabets and Words Using Deep Learning
Nada Faisal Aljohani ... Emad Sami Jaha
Computer Systems Science and Engineering | VOL. 46
Nada Faisal Aljohani, et. al.Nada Faisal Aljohani ... Emad Sami Jaha
01 Jan 2023
Computer Systems Science and Engineering | VOL. 46

Lip Detection and Lip Geometric Feature Extraction using Constrained Local Model for Spoken Language Identification using Visual Speech Recognition
Aparna Brahme ... Umesh Bhadade
Indian Journal of Science and Technology | VOL. 9
Aparna Brahme, et. al.Aparna Brahme ... Umesh Bhadade
30 Aug 2016
Indian Journal of Science and Technology | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on pattern analysis and machine intelligence