Abstract

When making a video, if the video has a language organization error, it needs to be re-recorded. It is not possible to remove inappropriate or unnatural pronunciation parts of the recording more effectively. In response to this problem, this paper studies the speech extraction, error correction and synthesis of video, which is divided into three parts: (1) Speech segmentation and speech-to-text of video; (2) Text recognition error correction; (3) Text-to-speech and video speech synthesis. For the first part, we applied the staged and efficient algorithm based on (Bayesian Information Criterion) BIC & (Statistical Mean Euclidean Distance) MEdist to segment the video voice, and then, the segmented audio is subtracted to reduce noise, and finally converted to text using the iFLYTEK interface. For the second part, we apply the (Double Automatic Error Correction) DAEC algorithm to text error correction. For the third part, we use the (Improved Chinese Realtime Voice Cloning) I-Zhrtvc for text-to-speech. Then merge the voice into the video. The simulation result shows that the staged and efficient algorithm based on BIC & MEdist, which accurately segmented by sentences, can identify audio with dialect accents, and has high accuracy in translating to text, up to an average of 95.8%. DAEC algorithm has a high error correction rate. The audio prosody accuracy after synthesis is high. ZVTOW text-to-speech (Mean Opinion Score) MOS up to 4.5.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call