Abstract
This study embraced the state-of-the-art Branchformer series architecture within the realm of automatic speech recognition, supplanting the widely utilized Conformer architecture. This substitution offers an innovative remedy tailored to audio-visual speech recognition tasks. Building upon the Branchformer architecture, enhancements were made, culminating in the proposal of the Relational-Branchformer (R-Branchformer). The convolutional attention relation module was innovatively incorporated to augment the connectivity between the local and global branches by meticulously considering their interrelations and interplays. Consequently, this module facilitates the mutual embedding of local and global contextual information, ultimately leading to a substantial enhancement in model performance. Our model was grounded in the utilization of the connectionist temporal classification (CTC) loss, wherein intermediate CTC losses were incorporated between blocks. Moreover, through the reference and enhancement of the gated interlayer collaboration module, which superseded the inter CTC module, the conditional independence assumption intrinsic to the CTC model was effectively relaxed. As a consequence, this augmentation markedly bolstered the overall performance of our model. Furthermore, the audio-visual output enhancement module was proposed, which adeptly assimilates information from both audio and visual modalities to enrich the representation of audio-visual information. Consequently, the R-Branchformer model achieved remarkable word error rates of 1.7% and 1.5% on the LRS2 and LRS3 test sets, respectively, exemplifying its state-of-the-art performance in audio-visual speech recognition tasks.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.