Abstract

Voice loss constitutes a crucial disorder which is highly associated with social isolation. The use of multimodal information sources, such as, audiovisual information, is crucial since it can lead to the development of straightforward personalized word prediction models which can reproduce the patient’s original voice. In this work we designed a multimodal approach based on audiovisual information from patients before loss-of-voice to develop a system for automated lip-reading in the Greek language. Data pre-processing methods, such as, lip-segmentation and frame-level sampling techniques were used to enhance the quality of the imaging data. Audio information was incorporated in the model to automatically annotate sets of frames as words. Recurrent neural networks were trained on four different video recordings to develop a robust word prediction model. The model was able to correctly identify test words in different time frames with 95% accuracy. To our knowledge, this is the first word prediction model that is trained to recognize words from video recordings in the Greek language.

Highlights

  • Our intention is to develop a high-performance DL model that will be able to capture the patient lips from video recordings to provide word predictions in the Greek language

  • The prediction performance of the proposed recurrent neural network (RNN) was favorable, yielding 84% accuracy in the last 500 epochs and 85% in the last 100 epochs during training

  • The proposed word prediction model adopts a recurrent neural network (RNN) architecture which belongs to the family of the sequence-to-sequence models

Read more

Summary

Introduction

The human voice is a fundamental characteristic of human communication and expression. Our intention is to develop a high-performance DL model that will be able to capture the patient lips from video recordings to provide word predictions in the Greek language. The frames are automatically annotated based on the provided annotation file The latter includes the time intervals for each word. The prediction performance of the proposed RNN was favorable, yielding 84% accuracy in the last 500 epochs and 85% in the last 100 epochs during training (a set of 1500 total epochs was used for the training process) To our knowledge, this is the first DL-empowered model that can capture Greek words from extracted lips with increased performance.

Existing Work
Data Sharing
The Proposed Architecture
Data Preprocessing (1) Extraction of frames from the video recordings
Model Development
Word Prediction Model Architecture
Results
Discussion and Future
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call