Video Indexing System Based on Multimodal Information Extraction Using Combination of ASR and OCR

Sandeep Varma,Soumya Deep Roy,Soham Das,Arunanshu Pandey,Shivam Shivam

doi:10.1007/978-3-030-96600-3_14

Abstract

AbstractWith the ever-increasing internet penetration across the world, there has been a huge surge in the content on the worldwide web. Video has proven to be one of the most popular media. The COVID-19 pandemic has further pushed the envelope, forcing learners to turn to E-Learning platforms. In the absence of relevant descriptions of these videos, it becomes imperative to generate metadata based on the content of the video. In the current paper, an attempt has been made to index videos based on the visual and audio content of the video. The visual content is extracted using an Optical Character Recognition (OCR) on the stack of frames obtained from a video while the audio content is generated using an Automatic Speech Recognition (ASR). The OCR and ASR generated texts are combined to obtain the final description of the respective video. The dataset contains 400 videos spread across 4 genres. To quantify the accuracy of our descriptions, clustering is performed using the video description to discern between the genres of video.KeywordsOptical Character RecognitionAutomatic Speech RecognitionVideo analyticsNatural language processingK-means clustering

Full Text