SGSG: Stroke-Guided Scene Graph Generation.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

3D scene graph generation is essential for spatial computing in Extended Reality (XR), providing structured semantics for task planning and intelligent perception. However, unlike instance-segmentation-driven setups, generating semantic scene graphs still suffer from limited accuracy due to coarse and noisy point cloud data typically acquired in practice, and from the lack of interactive strategies to incorporate users' spatialized and intuitive guidance. We identify three key challenges: designing controllable interaction forms, involving guidance in inference, and generalizing from local corrections. To address these, we propose SGSG, a Stroke-Guided Scene Graph generation method that enables users to interactively refine 3D semantic relationships and improve predictions in real time. We propose three types of strokes and a lightweight SGstrokes dataset tailored for this modality. Our model integrates stroke guidance representation and injection for spatio-temporal feature learning and reasoning correction, along with intervention losses that combine consistency-repulsive and geometry-sensitive constraints to enhance accuracy and generalization. Experiments and the user study show that SGSG outperforms state-of-the-art methods 3DSSG and SGFN in overall accuracy and precision, surpasses JointSSG in predicate-level metrics, and reduces task load across all control conditions, establishing SGSG as a new benchmark for interactive 3D scene graph generation and semantic understanding in XR. Implementation resources are available at: https://github.com/Sycamore-Ma/SGSG-runtime.

Similar Papers
  • Research Article
  • Cite Count Icon 4
  • 10.3233/sw-233510
NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichment
  • Oct 4, 2024
  • Semantic Web
  • M Jaleed Khan + 2 more

Exploring the potential of neuro-symbolic hybrid approaches offers promising avenues for seamless high-level understanding and reasoning about visual scenes. Scene Graph Generation (SGG) is a symbolic image representation approach based on deep neural networks (DNN) that involves predicting objects, their attributes, and pairwise visual relationships in images to create scene graphs, which are utilized in downstream visual reasoning. The crowdsourced training datasets used in SGG are highly imbalanced, which results in biased SGG results. The vast number of possible triplets makes it challenging to collect sufficient training samples for every visual concept or relationship. To address these challenges, we propose augmenting the typical data-driven SGG approach with common sense knowledge to enhance the expressiveness and autonomy of visual understanding and reasoning. We present a loosely-coupled neuro-symbolic visual understanding and reasoning framework that employs a DNN-based pipeline for object detection and multi-modal pairwise relationship prediction for scene graph generation and leverages common sense knowledge in heterogenous knowledge graphs to enrich scene graphs for improved downstream reasoning. A comprehensive evaluation is performed on multiple standard datasets, including Visual Genome and Microsoft COCO, in which the proposed approach outperformed the state-of-the-art SGG methods in terms of relationship recall scores, i.e. Recall@K and mean Recall@K, as well as the state-of-the-art scene graph-based image captioning methods in terms of SPICE and CIDEr scores with comparable BLEU, ROGUE and METEOR scores. As a result of enrichment, the qualitative results showed improved expressiveness of scene graphs, resulting in more intuitive and meaningful caption generation using scene graphs. Our results validate the effectiveness of enriching scene graphs with common sense knowledge using heterogeneous knowledge graphs. This work provides a baseline for future research in knowledge-enhanced visual understanding and reasoning. The source code is available at https://github.com/jaleedkhan/neusire.

  • Research Article
  • 10.1360/ssi-2022-0105
Balanced scene graph generation assisted by an additional biased predictor
  • Nov 1, 2022
  • SCIENTIA SINICA Informationis
  • Wenbin Wang + 2 more

A scene graph is a structural representation of a scene comprising the objects as nodes and relationships between any two objects as edges. The scene graph is widely adopted in high-level vision language and reasoning applications. Therefore, scene graph generation has been a popular topic in recent years. However, it is limited by bias due to the long-tailed distribution among the relationships. Scene graph generators prefer to predict the head predicates, which are ambiguous and less precise. It makes the scene graph convey less information and degenerate into the stacking of objects, restricting other applications from reasoning on the graph. To make the generator predict more diverse relationships and provide a precise scene graph, we propose an additional biased predictor (ABP)-assisted balanced learning method. This method introduces an extra relationship prediction branch that is especially affected by the bias to make the generator pay more attention to the tail predicates rather than the head ones. Compared to the scene graph generator that predicts relationships between object pairs, the biased branch predicts the relationships without being assigned a certain object pair of interest, which is more concise. To train this biased branch, the region-level relationship annotation is constructed using the instance-level relationship annotation automatically. Extensive experiments on popular datasets, i.e., Visual Genome, VRD, and OpenImages, show that the ABP is effective on different scene graph generators. Besides, it makes the generator predict more diverse and accurate relationships and provides a more balanced and practical scene graph.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ccis57298.2022.10016416
A Multimodal Fusion Scene Graph Generation Method Based on Semantic Description
  • Nov 26, 2022
  • Liwen Ma + 2 more

For the scene graph generation task, a multimodal fusion scene graph generation method based on semantic description is proposed considering the problems of long-tail distribution and low frequency of high-level semantic interactions in the dataset. Firstly, target detection and relationship inference are performed on the image to construct an image scene graph. Second, the semantic descriptions are transformed into semantic graphs, which are fed into a pre-trained scene graph parser to construct semantic scene graphs. Finally, the two scene graphs are aligned for display and the information of nodes and edges are updated to obtain a fused scene graph with more comprehensive coverage and more accurate semantic interaction information.

  • Conference Article
  • Cite Count Icon 542
  • 10.1109/cvpr42600.2020.00377
Unbiased Scene Graph Generation From Biased Training
  • Jun 1, 2020
  • Kaihua Tang + 4 more

Today’s scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse human walk on / sit on / lay on beach into human on beach. Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., person read book rather than eat) and bad long-tailed bias (e.g., near dominating behind / in front of). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.

  • Conference Article
  • Cite Count Icon 11
  • 10.1145/3474085.3475545
Mask and Predict
  • Oct 17, 2021
  • Hongshuo Tian + 6 more

Scene Graph Generation (SGG) aims to parse the image as a set of semantics, containing objects and their relations. Currently, the SGG methods only stay at presenting the intuitive detection in the image, such as the triplet logo on board. Intuitively, we humans can further refine these intuitive detections as rational descriptions like flower painted on surfboard. However, most of existing methods always formulate SGG as a straightforward task, only limited by the manner of one-time prediction, which focuses on a single-pass pipeline and predicts all the semantic. Therefore, to handle this problem, we propose a novel multi-step reasoning manner for SGG. Concretely, we break SGG into two explicit learning stages, including intuitive training stage (ITS) and rational training stage (RTS). In the first stage, we follow the traditional SGG processing to detect objects and relationships, yielding an intuitive scene graph. In the second stage, we perform multi-step reasoning to refine the intuitive scene graph. For each step of reasoning, it consists of two kinds of operations: mask and predict. According to primary predictions and their confidences, we constantly select and mask the low-confidence predictions, which features are optimized and predicted again. After several iterations, all of intuitive semantics will gradually tend to be revised with high confidences, yielding a rational scene graph. Extensive experiments on Visual Genome prove the superiority of the proposed method. Additional ablation studies and visualization cases further validate its effectiveness.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.3390/rs14133118
RSSGG_CS: Remote Sensing Image Scene Graph Generation by Fusing Contextual Information and Statistical Knowledge
  • Jun 29, 2022
  • Remote Sensing
  • Zhiyuan Lin + 6 more

To semantically understand remote sensing images, it is not only necessary to detect the objects in them but also to recognize the semantic relationships between the instances. Scene graph generation aims to represent the image as a semantic structural graph, where objects and relationships between them are described as nodes and edges, respectively. Some existing methods rely only on visual features to sequentially predict the relationships between objects, ignoring contextual information and making it difficult to generate high-quality scene graphs, especially for remote sensing images. Therefore, we propose a novel model for remote sensing image scene graph generation by fusing contextual information and statistical knowledge, namely RSSGG_CS. To integrate contextual information and calculate attention among all objects, the RSSGG_CS model adopts a filter module (FiM) that is based on adjusted transformer architecture. Moreover, to reduce the blindness of the model when searching semantic space, statistical knowledge of relational predicates between objects from the training dataset and the cleaned Wikipedia text is used as supervision when training the model. Experiments show that fusing contextual information and statistical knowledge allows the model to generate more complete scene graphs of remote sensing images and facilitates the semantic understanding of remote sensing images.

  • Book Chapter
  • Cite Count Icon 8
  • 10.1007/978-3-031-06981-9_6
Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning
  • Jan 1, 2022
  • Muhammad Jaleed Khan + 2 more

Scene graph generation aims to capture the semantic elements in images by modelling objects and their relationships in a structured manner, which are essential for visual understanding and reasoning tasks including image captioning, visual question answering, multimedia event processing, visual storytelling and image retrieval. The existing scene graph generation approaches provide limited performance and expressiveness for higher-level visual understanding and reasoning. This challenge can be mitigated by leveraging commonsense knowledge, such as related facts and background knowledge, about the semantic elements in scene graphs. In this paper, we propose the infusion of diverse commonsense knowledge about the semantic elements in scene graphs to generate rich and expressive scene graphs using a heterogeneous knowledge source that contains commonsense knowledge consolidated from seven different knowledge bases. The graph embeddings of the object nodes are used to leverage their structural patterns in the knowledge source to compute similarity metrics for graph refinement and enrichment. We performed experimental and comparative analysis on the benchmark Visual Genome dataset, in which the proposed method achieved a higher recall rate ( $$R@K = 29.89, 35.4, 39.12$$ for $$K = 20, 50, 100$$ ) as compared to the existing state-of-the-art technique ( $$R@K = 25.8, 33.3, 37.8$$ for $$K = 20, 50, 100$$ ). The qualitative results of the proposed method in a downstream task of image generation showed that more realistic images are generated using the commonsense knowledge-based scene graphs. These results depict the effectiveness of commonsense knowledge infusion in improving the performance and expressiveness of scene graph generation for visual understanding and reasoning tasks.

  • Conference Article
  • Cite Count Icon 294
  • 10.1109/cvpr.2019.00207
Scene Graph Generation With External Knowledge and Image Reconstruction
  • Jun 1, 2019
  • Jiuxiang Gu + 5 more

Scene graph generation has received growing attention with the advancements in image understanding tasks such as object detection, attributes and relationship prediction,~\etc. However, existing datasets are biased in terms of object and relationship labels, or often come with noisy and missing annotations, which makes the development of a reliable scene graph prediction model very challenging. In this paper, we propose a novel scene graph generation algorithm with external knowledge and image reconstruction loss to overcome these dataset issues. In particular, we extract commonsense knowledge from the external knowledge base to refine object and phrase features for improving generalizability in scene graph generation. To address the bias of noisy object annotations, we introduce an auxiliary image reconstruction path to regularize the scene graph generation network. Extensive experiments show that our framework can generate better scene graphs, achieving the state-of-the-art performance on two benchmark datasets: Visual Relationship Detection and Visual Genome datasets.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/skima57145.2022.10029570
Attention-Based Scene Graph Generation: A Review
  • Dec 2, 2022
  • Afsana Airin + 6 more

The automated creation of a semantic structural scene graph from an image or video is known as scene graph generation (SGG), which includes accurate labeling of all objects that are identified and the interconnections between them. Several SGG methods have been proposed employing deep learning techniques nowadays to achieve good results but most of the approaches failed to integrate the contextual information of pair of objects. Apart from the existing state of the arts of SGG, the attention mechanism is creating a new horizon in this field. This paper offers a thorough analysis of the most recent Attention-Based Scene Graph Generation techniques. In this paper, we have compared and tested five existing Attention-Based Scene Graph Generation methods. We have summarised the results of existing methods to understand progress in this field of Scene Graph Generation. Moreover, we have discussed the strengths of existing techniques and future directions of attention-based models in Scene Graph Generation.

  • Research Article
  • 10.3390/info15120766
Enabling Perspective-Aware Ai with Contextual Scene Graph Generation
  • Dec 2, 2024
  • Information
  • Daniel Platnick + 2 more

This paper advances contextual image understanding within perspective-aware Ai (PAi), an emerging paradigm in human–computer interaction that enables users to perceive and interact through each other’s perspectives. While PAi relies on multimodal data—such as text, audio, and images—challenges in data collection, alignment, and privacy have led us to focus on enabling the contextual understanding of images. To achieve this, we developed perspective-aware scene graph generation with LLM post-processing (PASGG-LM). This framework extends traditional scene graph generation (SGG) by incorporating large language models (LLMs) to enhance contextual understanding. PASGG-LM integrates classical scene graph outputs with LLM post-processing to infer richer contextual information, such as emotions, activities, and social contexts. To test PASGG-LM, we introduce the context-aware scene graph generation task, where the goal is to generate a context-aware situation graph describing the input image. We evaluated PASGG-LM pipelines using state-of-the-art SGG models, including Motifs, Motifs-TDE, and RelTR, and showed that fine-tuning LLMs, particularly GPT-4o-mini and Llama-3.1-8B, improves performance in terms of R@K, mR@K, and mAP. Our method is capable of generating scene graphs that capture complex contextual aspects, advancing human–machine interaction by enhancing the representation of diverse perspectives. Future directions include refining contextual scene graph models and expanding multi-modal data integration for PAi applications in domains such as healthcare, education, and social robotics.

  • Research Article
  • Cite Count Icon 1
  • 10.1109/tmi.2024.3444279
S²Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR.
  • Jan 1, 2025
  • IEEE transactions on medical imaging
  • Jialun Pei + 5 more

Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. Our source code can be made available at: https://github.com/PJLallen/S2Former-OR.

  • Research Article
  • Cite Count Icon 11
  • 10.1109/tcyb.2021.3052522
Relation Regularized Scene Graph Generation
  • Mar 12, 2021
  • IEEE Transactions on Cybernetics
  • Yuyu Guo + 6 more

Scene graph generation (SGG) is built on top of detected objects to predict object pairwise visual relations for describing the image content abstraction. Existing works have revealed that if the links between objects are given as prior knowledge, the performance of SGG is significantly improved. Inspired by this observation, in this article, we propose a relation regularized network (R2-Net), which can predict whether there is a relationship between two objects and encode this relation into object feature refinement and better SGG. Specifically, we first construct an affinity matrix among detected objects to represent the probability of a relationship between two objects. Graph convolution networks (GCNs) over this relation affinity matrix are then used as object encoders, producing relation-regularized representations of objects. With these relation-regularized features, our R2-Net can effectively refine object labels and generate scene graphs. Extensive experiments are conducted on the visual genome dataset for three SGG tasks (i.e., predicate classification, scene graph classification, and scene graph detection), demonstrating the effectiveness of our proposed method. Ablation studies also verify the key roles of our proposed components in performance improvement.

  • Research Article
  • Cite Count Icon 25
  • 10.1016/j.neucom.2023.127052
Scene Graph Generation: A comprehensive survey
  • Nov 20, 2023
  • Neurocomputing
  • Hongsheng Li + 9 more

Deep learning techniques have led to remarkable breakthroughs in the field of object detection and have spawned a lot of scene-understanding tasks in recent years. Scene graph has been the focus of research because of its powerful semantic representation and applications to scene understanding. Scene Graph Generation (SGG) refers to the task of automatically mapping an image or a video into a semantic structural scene graph, which requires the correct labeling of detected objects and their relationships. In this paper, a comprehensive survey of recent achievements is provided. This survey attempts to connect and systematize the existing visual relationship detection methods, to summarize, and interpret the mechanisms and the strategies of SGG in a comprehensive way. Deep discussions about current existing problems and future research directions are given at last. This survey will help readers to develop a better understanding of the current researches.

  • Conference Article
  • Cite Count Icon 16
  • 10.1145/3459637.3482218
Lightweight Visual Question Answering using Scene Graphs
  • Oct 26, 2021
  • Sai Vidyaranya Nuthalapati + 6 more

Visual question answering (VQA) is a challenging problem in machine perception, which requires a deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, while powerful yet elegant models like graph neural networks (GNNs) have shown great power in reasoning over graph-structured data. In this work, we propose to bridge the gap between scene graph generation and VQA by leveraging GNNs. In particular, we design a new model called Conditional Enhanced Graph ATtention network (CE-GAT) to encode pairs of visual and semantic scene graphs with both node and edge features, which is seamlessly integrated with a textual question encoder to generate answers through question-graph conditioning. Moreover, to alleviate the training difficulties of CE-GAT towards VQA, we enforce more useful inductive biases in the scene graphs through novel question-guided graph enriching and pruning. Finally, we evaluate the framework on one of the largest available VQA datasets (namely, GQA) with ground-truth scene graphs, achieving the accuracy of 77.87%, compared with the state of the art (namely, the neural state machine (NSM)), which gives 63.17%. Notably, by leveraging existing scene graphs, our framework is much lighter compared with end-to-end VQA methods (e.g., about 95.3% less parameters than a typical NSM).

  • Conference Article
  • Cite Count Icon 14
  • 10.1109/iccv48922.2021.01605
Unconditional Scene Graph Generation
  • Oct 1, 2021
  • Sarthak Garg + 5 more

Despite recent advancements in single-domain or single-object image generation, it is still challenging to generate complex scenes containing diverse, multiple objects and their interactions. Scene graphs, composed of nodes as objects and directed-edges as relationships among objects, offer an alternative representation of a scene that is more semantically grounded than images. We hypothesize that a generative model for scene graphs might be able to learn the underlying semantic structure of real-world scenes more effectively than images, and hence, generate realistic novel scenes in the form of scene graphs. In this work, we explore a new task for the unconditional generation of semantic scene graphs. We develop a deep auto-regressive model called SceneGraphGen which can directly learn the probability distribution over labelled and directed graphs using a hierarchical recurrent architecture. The model takes a seed object as input and generates a scene graph in a sequence of steps, each step generating an object node, followed by a sequence of relationship edges connecting to the previous nodes. We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes. Additionally, we demonstrate the application of the generated graphs in image synthesis, anomaly detection and scene graph completion.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon