Abstract

Understanding video files is a challenging task. While the current video understanding techniques rely on deep learning, the obtained results suffer from a lack of real trustful meaning. Deep learning recognizes patterns from big data, leading to deep feature abstraction, not deep understanding. Deep learning tries to understand multimedia production by analyzing its content. We cannot understand the semantics of a multimedia file by analyzing its content only. Events occurring in a scene earn their meanings from the context containing them. A screaming kid could be scared of a threat or surprised by a lovely gift or just playing in the backyard. Artificial intelligence is a heterogeneous process that goes beyond learning. In this article, we discuss the heterogeneity of AI as a process that includes innate knowledge, approximations, and context awareness. We present a context-aware video understanding technique that makes the machine intelligent enough to understand the message behind the video stream. The main purpose is to understand the video stream by extracting real meaningful concepts, emotions, temporal data, and spatial data from the video context. The diffusion of heterogeneous data patterns from the video context leads to accurate decision-making about the video message and outperforms systems that rely on deep learning. Objective and subjective comparisons prove the accuracy of the concepts extracted by the proposed context-aware technique in comparison with the current deep learning video understanding techniques. Both systems are compared in terms of retrieval time, computing time, data size consumption, and complexity analysis. Comparisons show a significant efficient resource usage of the proposed context-aware system, which makes it a suitable solution for real-time scenarios. Moreover, we discuss the pros and cons of deep learning architectures.

Highlights

  • Current smartphones come with great hardware and software capabilities. ese devices gave their owners the ability to become active online publishers

  • We describe artificial general intelligence (AGI) as a heterogeneous process that includes learning and innate knowledge, approximation, and context-awareness

  • In 2015, [15] introduced long-term recurrent convolutional networks, where the outputs of a 2D Convolutional Neural Networks (CNNs) are fed into a stack of Long Short-Term Memory (LSTM) networks

Read more

Summary

Introduction

Current smartphones come with great hardware and software capabilities. ese devices gave their owners the ability to become active online publishers. At is the path toward real artificial general intelligence which could exist in our daily life and bring real value To make this possible, we need to mind the semantic gap [2], shown, between the low-level features that represent the audio, visual, and textual content of the video and the high-level concepts as perceived by human cognition. A supporter will celebrate that goal while watching the match (visual signal), listening to the commentator (audio and sound signals), and reading comments (textual data) To make such a video file available and reachable to the concerned target audience, a human-like cognition architecture is needed to process all the signals of the video file, correlate them to the surrounding context, and recognize the different actions within the scene. Deep learning achieves noticeable progress in the fields of pattern recognition [4, 5], beating humans at games level [6], neuroradiology [7], healthcare [8], FEA design and misfit minimization [9], travel decision frameworks [10], data-driven Earth system science [11], and analysis of graph signals [12]

Video Representation
Background
A Proposed Context-Aware System for Video Understanding
Experimental Benchmark
G Major Manager
Evaluation Metrics
Conclusions and Future
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call