Video Visual Relation Detection With Contextual Knowledge Embedding

Qianwen Cao,Heyan Huang

doi:10.1109/tkde.2023.3270328

Abstract

Video visual relation detection (VidVRD) aims at abstracting structured relations in the form of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$< $</tex-math></inline-formula> <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">subject-predicate-object</i> <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$>$</tex-math></inline-formula> from videos. The triple formation makes the search space extremely huge and the distribution unbalanced. Usually, existing works predict the relationships from visual, spatial, and semantic cues. Among them, semantic cues are responsible for exploring the semantic connections between objects, which is crucial to transfer knowledge across relations. However, most of these works extract semantic cues via simply mapping the object labels to classified features, which ignore the contextual surroundings, resulting in poor performance for low-frequency relations. To alleviate these issues, we propose a novel network, termed Contextual Knowledge Embedded Relation Network (CKERN), to facilitate VidVRD through establishing contextual knowledge embeddings for detected object pairs in relations from two aspects: commonsense attributes and prior linguistic dependencies. Specifically, we take the pair as a query to extract relational facts in the commonsense knowledge base, then encode them to explicitly construct semantic surroundings for relations. In addition, the statistics of object pairs with different predicates distilled from large-scale visual relations are taken into account to represent the linguistic regularity of relations. Extensive experimental results on benchmark datasets demonstrate the effectiveness and robustness of our proposed model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Video Visual Relation Detection With Contextual Knowledge Embedding

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering

Lead the way for us

Similar Papers

Audio-Visual and Meaningful Semantic Context Enhancements in Older and Younger Adults.
Kirsten E Smayda ... Bharath Chandrasekaran
PloS one | VOL. 11
Kirsten E Smayda, et. al.Kirsten E Smayda ... Bharath Chandrasekaran
31 Mar 2016
PloS one | VOL. 11

Iterative Visual Relationship Detection via Commonsense Knowledge Graph
Hai Wan ... Jeff Z Pan
-
Hai Wan, et. al.Hai Wan ... Jeff Z Pan
01 Jan 2020
01 Jan 2020

The Effect of Situation-Specific Non-Speech Acoustic Cues on the Intelligibility of Speech in Noise
Lauren Ward ... Yan Tang
-
Lauren Ward, et. al.Lauren Ward ... Yan Tang
20 Aug 2017
20 Aug 2017

Discovering the lexical features of a language
Eric Brill
-
Eric BrillEric Brill
01 Jan 1991
01 Jan 1991

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Video Visual Relation Detection With Contextual Knowledge Embedding

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering