Abstract
In the Multimedia era, the production of video data is massive due to advancements in computing and network infrastructure. To summarize the video events, the entire video needs to be watched, which may take a few hours to summarize events in the video. Manually generating a summary of the video events requires considerable human time and effort. In this work, a six-layer Convolutional Neural Network (CNN) is proposed to characterize images input to the Encoder-Decoder framework. The framework summarizes video events by taking images as input and generates a summary like a natural language rather than machine-generated text. It produces the synopsis in sentences for each sequence of images as output. The quantitative and qualitative analysis shows that the model outperforms the existing state-of-the-art model in terms of Training loss, Metric for Evaluation of Translation with Explicit ORdering (METEOR) score, and BiLingual Evaluation Understudy (BLEU) score. Applications of the proposed model include visual storytelling identifying critical events in the entire video.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.