Abstract
For the scene graph generation task, a multimodal fusion scene graph generation method based on semantic description is proposed considering the problems of long-tail distribution and low frequency of high-level semantic interactions in the dataset. Firstly, target detection and relationship inference are performed on the image to construct an image scene graph. Second, the semantic descriptions are transformed into semantic graphs, which are fed into a pre-trained scene graph parser to construct semantic scene graphs. Finally, the two scene graphs are aligned for display and the information of nodes and edges are updated to obtain a fused scene graph with more comprehensive coverage and more accurate semantic interaction information.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have