Abstract

Machine comprehension of visual information from images and videos by neural networks suffers from two limitations: (1) the computational and inference gap in vision and language to accurately determine which object a given agent acts on and then to represent it by language, and (2) the shortcoming in stability and generalization of the classifier trained by a single, monolithic neural network. To address these limitations, we propose MoE-VRD, a novel approach to visual relationship detection via a mixture of experts. MoE-VRD recognizes language triplets in the form of a < subject,predicate,object > tuple to extract the relationship between subject, predicate, and object from visual processing. Since detecting a relationship between a subject (acting) and the object(s) (being acted upon) requires that the action be recognized, we base our network on recent work in visual relationship detection. To address the limitations associated with single monolithic networks, our mixture of experts is based on multiple small models, whose outputs are aggregated. That is, each expert in MoE-VRD is a visual relationship learner capable of detecting and tagging objects. MoE-VRD employs an ensemble of networks while preserving the complexity and computational cost of the original underlying visual relationship model by applying a sparsely-gated mixture of experts, which allows for conditional computation and a significant gain in neural network capacity. We show that the conditional computation capabilities and massive ability to scale the mixture-of-experts leads to an approach to the visual relationship detection problem which outperforms the state-of-the-art.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call