Abstract

In this paper, we introduce the DEEP-HEAR framework, a multimodal dynamic subtitle positioning system designed to increase the accessibility of deaf and hearing impaired people (HIP) to multimedia documents. The proposed system exploits both computer vision algorithms and deep convolutional neural networks specifically designed and tuned in order to detect and recognize the identity of the active speaker. The main contributions of the paper concern: a novel method dedicated to recognizing various characters existent in the video stream. A video temporal segmentation algorithm that divides the video sequence into semantic units, based on face tracks and visual consistency. Finally, the core of our approach concerns a novel active speaker recognition method relying on the multimodal information fusion from the text, audio, and video streams. The experimental results carried out on a large scale dataset of more than 30 videos, validate the proposed methodology with average accuracy and recognition rates superior to 90%. Moreover, the method shows robustness to important object/camera motion and face pose variation, yielding gains of more than 8% in precision and recall rates when compared with state-of-the-art techniques. The subjective evaluation of the proposed dynamic subtitle positioning system demonstrates the effectiveness of our approach.

Highlights

  • The recent statistics published by the World Health Organization [1] show that for people aged over 50 years the hearing impairments become progressively common in the world

  • In order to facilitate the access to information and fit the needs of people with hearing disabilities, most of the TV broadcasters transmit and distribute, together with the audio and video signals, textual information, which is presented under the form of video subtitles or close captions

  • In order to evaluate the influence of each component of our system over the speaker recognition performances, we have considered for comparison: (1) An active speaker recognition strategy based solely on the face recognition module

Read more

Summary

INTRODUCTION

The recent statistics published by the World Health Organization [1] show that for people aged over 50 years the hearing impairments become progressively common in the world. In contrast with existing systems, where the close caption are always positioned in a fixed position at the bottom of the screen, our approach helps the hearing impaired users to match the scripts with the corresponding characters, by positioning the subtitles in a manner that makes it possible to identify the active speaker. The DEEP-HEAR framework (Fig. 1), jointly exploits computer vision algorithms and deep convolutional neural networks (CNNs) in order to achieve the various stages necessary to this purpose, including face detection, tracking and recognition, video temporal segmentation, active speaker detection and recognition, background text detection and subtitle positioning.

RELATED WORK
ACTIVE SPEAKER DETECTION
SUBTITLE POSITIONING
EXPERIMENTAL EVALUATION
OBJECTIVE EVALUATION
Findings
CONCLUSION AND PERSPECTIVES

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.