Abstract

Visually impaired individuals face many difficulties in their daily lives. In this study, a video captioning system has been developed for visually impaired individuals to analyze the events through real-time images and express them in meaningful sentences. It is aimed to better understand the problems experienced by visually impaired individuals in their daily lives. For this reason, the opinions and suggestions of the disabled individuals within the Altınokta Blind Association (Turkish organization of blind people) have been collected to produce more realistic solutions to their problems. In this study, MSVD which consists of 1970 YouTube clips has been used as training dataset. First, all clips have been muted so that the sounds of the clips have not been used in the sentence extraction process. The CNN and LSTM architectures have been used to create sentence and experimental results have been compared using BLEU 4, ROUGE-L and CIDEr and METEOR.

Highlights

  • In order to facilitate the lives of visually impaired individuals, the new technologies have been developed at last decade

  • Automatic video description model should be able to express objects and events presented in the video

  • The BLEU algorithm compares the array of expressions generated by the model with the reference expressions of the video and gives scores based on the number of matches

Read more

Summary

INTRODUCTION

In order to facilitate the lives of visually impaired individuals, the new technologies have been developed at last decade. The successive (sequence to sequence) video frames have been taken as an input and as the output, it has been given successive words. To describe the video in detail, the mechanism called temporal attention with CNN and LSTM has been applied by them. This mechanism has been used to better identify the important events in the video. Wang et al, have presented a new model that combines audio and visual cues called HACA (Hierarchically Aligned Cross-modal Attentive) network [12]. The performance of developed model has been evaluated on the Microsoft Video Description Corpus (MSVD) and the BLUE, ROUGE, CIDEr and METEOR metrics

VIDEO CAPTIONING
DEEP LEARNING
DEVELOPED MODEL
EXPERIMENTAL RESULTS
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call