Traditional English corpora mainly collect information from a single modality, but lack information from multimodal information, resulting in low quality of corpus information and certain problems with recognition accuracy. To solve the above problems, this paper proposes to introduce depth information into multimodal corpora, and studies the construction method of English multimodal corpora that integrates electronic images and depth information, as well as the speech recognition method of the corpus. The multimodal fusion strategy adopted integrates speech signals and image information, including key visual information such as the speaker’s lip movements and facial expressions, and uses deep learning technology to mine acoustic and visual features. The acoustic model in the Kaldi toolkit is used for experimental research.Through experimental research, the following conclusions were drawn: Under 15-dimensional lip features, the accuracy of corpus A under monophone model was 2.4% higher than that of corpus B under monophone model when the SNR (signal-to-noise ratio) was 10dB, and the accuracy of corpus A under the triphone model at the signal-to-noise ratio of 10dB was 1.7% higher than that of corpus B under the triphone model at the signal-to-noise ratio of 10dB. Under the 32-dimensional lip features, the speech recognition effect of corpus A under the monophone model at the SNR of 10dB was 1.4% higher than that of corpus B under the monophone model at the SNR of 10dB, and the accuracy of corpus A under the triphone model at the SNR of 10dB was 2.6% higher than that of corpus B under the triphone model at the SNR of 10dB. The English multimodal corpus with image and depth information has a high accuracy, and the depth information helps to improve the accuracy of the corpus.
Read full abstract