Speech recognition using an english multimodal corpus with integrated image and depth information

Bing Wang

doi:10.1038/s41598-024-78557-2

Bing Wang

Open Access

https://doi.org/10.1038/s41598-024-78557-2

Copy DOI

Export

Save

Cite

Journal: Scientific Reports	Publication Date: Nov 6, 2024
License type: CC BY-NC-ND 4.0

Abstract
Full-Text
Similar Papers

Abstract

Listen

Traditional English corpora mainly collect information from a single modality, but lack information from multimodal information, resulting in low quality of corpus information and certain problems with recognition accuracy. To solve the above problems, this paper proposes to introduce depth information into multimodal corpora, and studies the construction method of English multimodal corpora that integrates electronic images and depth information, as well as the speech recognition method of the corpus. The multimodal fusion strategy adopted integrates speech signals and image information, including key visual information such as the speaker’s lip movements and facial expressions, and uses deep learning technology to mine acoustic and visual features. The acoustic model in the Kaldi toolkit is used for experimental research.Through experimental research, the following conclusions were drawn: Under 15-dimensional lip features, the accuracy of corpus A under monophone model was 2.4% higher than that of corpus B under monophone model when the SNR (signal-to-noise ratio) was 10dB, and the accuracy of corpus A under the triphone model at the signal-to-noise ratio of 10dB was 1.7% higher than that of corpus B under the triphone model at the signal-to-noise ratio of 10dB. Under the 32-dimensional lip features, the speech recognition effect of corpus A under the monophone model at the SNR of 10dB was 1.4% higher than that of corpus B under the monophone model at the SNR of 10dB, and the accuracy of corpus A under the triphone model at the SNR of 10dB was 2.6% higher than that of corpus B under the triphone model at the SNR of 10dB. The English multimodal corpus with image and depth information has a high accuracy, and the depth information helps to improve the accuracy of the corpus.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Speech recognition using an english multimodal corpus with integrated image and depth information

Abstract

Published Version

Talk to us

Similar Papers

More From: Scientific Reports

Lead the way for us

Similar Papers

Analysis of Digit Recognition in Kannada Using Kaldi Toolkit
K Sundar Karthikeyan ... K Jeeva Priya
-
K Sundar Karthikeyan, et. al.K Sundar Karthikeyan ... K Jeeva Priya
01 Jan 2019
01 Jan 2019

DNN based continuous speech recognition system of Punjabi language on Kaldi toolkit
Jyoti Guglani ... A. N. Mishra
International Journal of Speech Technology | VOL. 24
Jyoti Guglani, et. al.Jyoti Guglani ... A. N. Mishra
20 May 2020
International Journal of Speech Technology | VOL. 24

Performance of Isolated and Continuous Digit Recognition System using Kaldi Toolkit
-
International Journal of Recent Technology and Engineering | VOL. 8
--
26 Jul 2019
International Journal of Recent Technology and Engineering | VOL. 8

Speech recognition method and system using triphones, diphones, and phonemes
Jie Yi
The Journal of the Acoustical Society of America | VOL. 100
Jie YiJie Yi
01 Jan 1996
The Journal of the Acoustical Society of America | VOL. 100

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Speech recognition using an english multimodal corpus with integrated image and depth information

Abstract

Published Version

Talk to us

Similar Papers

More From: Scientific Reports