Abstract

Technological advancement, in addition to the pandemic, has given rise to an explosive increase in the consumption and creation of multimedia content worldwide. This has motivated people to enrich and publish their content in a way that enhances the experience of the user. In this paper, we propose a context-based structure mining pipeline that not only attempts to enrich the content, but also simultaneously splits it into shots and logical story units (LSU). Subsequently, this paper extends the structure mining pipeline to re-ID objects in broadcast videos such as SOAPs. We hypothesise the object re-ID problem of SOAP-type content to be equivalent to the identification of reoccurring contexts, since these contexts normally have a unique spatio-temporal similarity within the content structure. By implementing pre-trained models for object and place detection, the pipeline was evaluated using metrics for shot and scene detection on benchmark datasets, such as RAI. The object re-ID methodology was also evaluated on 20 randomly selected episodes from broadcast SOAP shows New Girl and Friends. We demonstrate, quantitatively, that the pipeline outperforms existing state-of-the-art methods for shot boundary detection, scene detection, and re-identification tasks.

Highlights

  • Due to advances in storage and digital media technology, videos have become the main source of visual information

  • We propose a novel multi-object re-ID algorithm-based on context similarity in SOAP and broadcast content to generate object timelines

  • We propose an algorithm that formulates unique object IDs using logical story units (LSU) and framelevel object detections, such that re-occurring objects are provided with the same ID

Read more

Summary

Introduction

Due to advances in storage and digital media technology, videos have become the main source of visual information. Apart from this, there are very many broadcast channels with enormous amounts of video content—shot and stored every second With such large collections of videos, it is very difficult to locate the appropriate video files and extract information from them effectively. The temporal nature of the content, and the lack of proper indexing methods to leverage non-textual features, creates difficulty in cataloguing and retrieving videos efficiently [1]. To address these challenges, efforts are being made—in every direction—to bridge the gap between low-level binary video representations and high-level text-based video descriptions (e.g., video categories, types or genre) [2,3,4,5,6,7]. The proposed architecture extracts semantic tags such as objects, actions and locations from the videos, using them to obtain scene/shot boundaries, and to re-ID objects from the video

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.