Abstract

The visual speech recognition technology look for increase noise-sound in mobile context by extracting lip movement from input face pictures. Bimodal speech recognition algorithms depended on the frontal face (lip) photos, but because users must be able to speak holding a device by using this device in front of the user face, these techniques are difficult for users to utilize. Our suggested method records lip movement using a tiny camera built into a smartphone, making it more practical, simple, and natural. Additionally, this technique effectively prevents a decline in the input signal-to-noise ratio (SNR). For CNN-based recognition, visual properties are extracted via optical-flow analysis and merged with audio data. Although in the previous model, it doesn't give the output in the audio format. Hence in our proposed model, we can provide the output in the audio format. Our proposed aim to record a user speaking into the camera or user will upload the video. The system will initially detect only the lip area from the video. The system will divide this lip video into multiple frames. After sequencing the lip frames, feature extraction will be done from the lip frames. The model will be trained to extract these features further these extracted features from the trained model will be used to find out the sequence of phoneme distributions. The final output will be the word or phrase spoken by the user displayed on the camera.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call