Taking Language Embedded 3D Gaussian Splatting into the Wild.
Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide. However, existing methods primarily focus on visual appearance reconstruction, often overlooking the interactive semantic understanding of these 3D scenes (e.g., identifying specific building parts or scene details), which remains largely confined to browsing static text-image pairs. Therefore, can we draw inspiration from 3D in-the-wild reconstruction techniques and use unconstrained photo collections to create an immersive approach for comprehensive 3D scene understanding beyond mere visual appearance? To this end, we extend language embedded 3D Gaussian splatting (3DGS) and propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections. Specifically, we first render multiple appearance images from the same viewpoint as the unconstrained image with the reconstructed radiance field, then extract multi-appearance CLIP features and two types of language feature uncertainty maps-transient and appearance uncertainty-derived from the multi-appearance features to guide the subsequent optimization process. Next, we propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features from multiple appearances. Finally, to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark dataset for assessing open-vocabulary segmentation performance on unconstrained photo collections. Experimental results show that our method outperforms existing methods, delivering accurate open-vocabulary segmentation and enabling applications such as interactive roaming with open-vocabulary queries, architectural style pattern recognition, and 3D scene editing. Visit our project page at Project Page.
- Research Article
4
- 10.1609/aaai.v37i3.25453
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
Learning descriptive 3D features is crucial for understanding 3D scenes with diverse objects and complex structures. However, it is usually unknown whether important geometric attributes and scene context obtain enough emphasis in an end-to-end trained 3D scene understanding network. To guide 3D feature learning toward important geometric attributes and scene context, we explore the help of textual scene descriptions. Given some free-form descriptions paired with 3D scenes, we extract the knowledge regarding the object relationships and object attributes. We then inject the knowledge to 3D feature learning through three classification-based auxiliary tasks. This language-assisted training can be combined with modern object detection and instance segmentation methods to promote 3D semantic scene understanding, especially in a label-deficient regime. Moreover, the 3D feature learned with language assistance is better aligned with the language features, which can benefit various 3D-language multimodal tasks. Experiments on several benchmarks of 3D-only and 3D-language tasks demonstrate the effectiveness of our language-assisted 3D feature learning. Code is available at https://github.com/Asterisci/Language-Assisted-3D.
- Conference Article
- 10.1109/icvrv.2017.00121
- Oct 1, 2017
With the rapid growth of virtual reality industry, fast and accurate algorithms for scene reconstruction and understanding became the research focus in related fields. Traditional methods always consider the 3D model and scene understanding as two problems and work them out separately. In this paper, we propose a new method to reconstruct semantic 3D models from multi-view images. This method not only contains information of points in 3D space, but also builds up their relationship with pixels from images. We commit experiments on four real challenging datasets to test the effectiveness of our proposed method. The reconstruction can be directly applied to virtual reality applications, such as roaming in 3D scenes.
- Dissertation
2
- 10.32657/10356/182101
- Jan 1, 2024
With the rapid development of industry and intelligent systems, semantic scene understanding has become essential for robotic vision in smart manufacturing. Robots have significantly advanced modern manufacturing by enabling high-quality, efficient production, extended operation durations, and work in hazardous environments. Robotic techniques have automated many processes in production lines. However, in flexible production scenarios, certain tasks cannot yet be fully handled by robots and still require human involvement. This limitation is usually caused by robots' lack of semantic understanding of the target objects in the working environment. Developing visual scene understanding techniques can enable robots to accurately recognize and localize objects or regions in visual scenes at the pixel level. These techniques greatly enhance the capability and flexibility of robots in the manufacturing industry and in various general robotic applications. Consequently, human effort in the production pipeline can be largely replaced by robots with visual understanding capabilities. This research mainly focuses on the task of 3D instance segmentation, which aims to predict both semantic and instance labels for each point in point clouds. This is a fundamental and challenging task for scene understanding, with a variety of real-world applications, such as indoor robots, autonomous driving, drones, AR/VR devices, etc. We propose five different novel methods, including one fully supervised method, two weakly supervised methods, one zero-shot method, and an augmentation method to enhance model generalization. In Chapter 3, we propose a novel proposal-free fully supervised method as Regional Purity Guide Network(RPGN). We define a novel concept of regional purity, which encodes instance-aware contextual information of the surrounding region. We also propose a pretraining pipeline for learning regional purity and design rules to generate random toy scenes by extracting samples from existing training data. Using regional purity can simultaneously prevent under-segmentation and over-segmentation problems during clustering. Although scene understanding has achieved remarkable success with deep learning techniques, it remains largely unsolved. One critical bottleneck is the significant human effort required for pixel-level labeling. To address this issue, in Chapter 4, we propose a novel weakly supervised method, RWSeg, that requires labeling only one object with a single point. Using these sparse weak labels, we introduce a unified framework with two branches to propagate semantic and instance information to unannotated regions, leveraging self-attention and random walk. Furthermore, we propose a Cross-graph Competing Random Walks (CGCRW) algorithm which encourages competition among different instance graphs to resolve ambiguities in closely positioned objects and improve the performance on instance assignment. In Chapter 5, we propose the first weakly-supervised 3D instance segmentation method that only needs categorical semantic labels as supervision, and we do not need instance-level labels. Even without having any instance-related ground-truth, we design an approach to break point clouds into raw fragments and find the most confident samples for learning instance centroids. In addition, we build a recomposed dataset to learn our defined multilevel shape-aware objectness signal. An asymmetrical object inference algorithm is followed to process core points and boundary points with different strategies, and generate high-quality pseudo instance labels to guide iterative training. In the current era dominated by large foundation models, these expansive vision models adeptly capture knowledge from vast, broad datasets, enabling them to execute zero-shot segmentation on previously unseen data. In Chapter 6, we delve into leveraging various 2D foundation models to address the challenges of 3D segmentation tasks. Our approach begins by generating initial predictions of 2D semantic masks using diverse large foundation models. These mask predictions, obtained from different frames of RGB-D video sequences, are then projected into 3D space. To produce robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all results through voting. Our investigation encompasses various scenarios, including zero-shot learning and limited guidance from sparse 2D point labels, allowing us to evaluate the strengths and limitations of different vision foundation models. Data augmentation is essential in deep learning for improving model generalization and robustness. While standard methods like rotations and flips have been common, they often lack high-level diversity. In Chapter 7, we explore a novel approach to automatically generate 3D labeled training data. By utilizing diffusion models and chatGPT generated text prompts, we generate diverse 2D images of single objects with various structures and appearances. Beyond texture augmentation, our method automatically alters object shapes within these images. These augmented images are then transformed into 3D objects, and virtual scenes are constructed through random composition. This approach efficiently produces a substantial amount of 3D scene data without relying on real data, offering significant advantages in addressing few-shot learning challenges and mitigating long-tailed class imbalances. Our work contributes to enhancing 3D data diversity and advancing model capabilities in scene understanding tasks.
- Conference Article
10
- 10.1109/cvpr46437.2021.00740
- Jun 1, 2021
Recent advances in 3D semantic scene understanding have shown impressive progress in 3D instance segmentation, enabling object-level reasoning about 3D scenes; however, a finer-grained understanding is required to enable interactions with objects and their functional understanding. Thus, we propose the task of part-based scene understanding of real-world 3D environments: from an RGB-D scan of a scene, we detect objects, and for each object predict its decomposition into geometric part masks, which composed together form the complete geometry of the observed object. We leverage an intermediary part graph representation to enable robust completion as well as building of part priors, which we use to construct the final part mask predictions. Our experiments demonstrate that guiding part understanding through part graph to part prior-based predictions significantly outperforms alternative approaches to the task of semantic part completion.
- Research Article
3
- 10.1109/tip.2024.3421952
- Jan 1, 2024
- IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
2D-3D joint learning is essential and effective for fundamental 3D vision tasks, such as 3D semantic segmentation, due to the complementary information these two visual modalities contain. Most current 3D scene semantic segmentation methods process 2D images "as they are", i.e., only real captured 2D images are used. However, such captured 2D images may be redundant, with abundant occlusion and/or limited field of view (FoV), leading to poor performance for the current methods involving 2D inputs. In this paper, we propose a general learning framework for joint 2D-3D scene understanding by selecting informative virtual 2D views of the underlying 3D scene. We then feed both the 3D geometry and the generated virtual 2D views into any joint 2D-3D-input or pure 3D-input based deep neural models for improving 3D scene understanding. Specifically, we generate virtual 2D views based on an information score map learned from the current 3D scene semantic segmentation results. To achieve this, we formalize the learning of the information score map as a deep reinforcement learning process, which rewards good predictions using a deep neural network. To obtain a compact set of virtual 2D views that jointly cover informative surfaces of the 3D scene as much as possible, we further propose an efficient greedy virtual view coverage strategy in the normal-sensitive 6D space, including 3-dimensional point coordinates and 3-dimensional normal. We have validated our proposed framework for various joint 2D-3D-input or pure 3D-input based deep neural models on two real-world 3D scene datasets, i.e., ScanNet v2 and S3DIS, and the results demonstrate that our method obtains a consistent gain over baseline models and achieves new top accuracy for joint 2D and 3D scene semantic segmentation. Code is available at https://github.com/smy-THU/VirtualViewSelection.
- Research Article
- 10.3390/s24196166
- Sep 24, 2024
- Sensors (Basel, Switzerland)
Three-dimensional (3D) Scene Understanding achieves environmental perception by extracting and analyzing point cloud data with wide applications including virtual reality, robotics, etc. Previous methods align the 2D image feature from a pre-trained CLIP model and the 3D point cloud feature for the open vocabulary scene understanding ability. We believe that existing methods have the following two deficiencies: (1) the 3D feature extraction process ignores the challenges of real scenarios, i.e., point cloud data are very sparse and even incomplete; (2) the training stage lacks direct text supervision, leading to inconsistency with the inference stage. To address the first issue, we employ a Masked Consistency training policy. Specifically, during the alignment of 3D and 2D features, we mask some 3D features to force the model to understand the entire scene using only partial 3D features. For the second issue, we generate pseudo-text labels and align them with the 3D features during the training process. In particular, we first generate a description for each 2D image belonging to the same 3D scene and then use a summarization model to fuse these descriptions into a single description of the scene. Subsequently, we align 2D-3D features and 3D-text features simultaneously during training. Massive experiments demonstrate the effectiveness of our method, outperforming state-of-the-art approaches.
- Research Article
16
- 10.1111/cgf.12433
- Aug 1, 2014
- Computer Graphics Forum
To design a bas‐relief from a 3D scene is an inherently interactive task in many scenarios. The user normally needs to get instant feedback to select a proper viewpoint. However, current methods are too slow to facilitate this interaction. This paper proposes a two‐scale bas‐relief modeling method, which is computationally efficient and easy to produce different styles of bas‐reliefs. The input 3D scene is first rendered into two textures, one recording the depth information and the other recording the normal information. The depth map is then compressed to produce a base surface with level‐of‐depth, and the normal map is used to extract local details with two different schemes. One scheme provides certain freedom to design bas‐reliefs with different visual appearances, and the other provides a control over the level of detail. Finally, the local feature details are added into the base surface to produce the final result. Our approach allows for real‐time computation due to its implementation on graphics hardware. Experiments with a wide range of 3D models and scenes show that our approach can effectively generate digital bas‐reliefs in real time.
- Research Article
18
- 10.5194/isprs-annals-v-1-2020-165-2020
- Aug 3, 2020
- ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Abstract. 3D indoor mapping and scene understanding have seen tremendous progress in recent years due to the rapid development of sensor systems, reconstruction techniques and semantic segmentation approaches. However, the quality of the acquired data strongly influences the accuracy of both reconstruction and segmentation. In this paper, we direct our attention to the evaluation of the mapping capabilities of the Microsoft HoloLens in comparison to high-quality TLS systems with respect to 3D indoor mapping, feature extraction and semantic segmentation. We demonstrate how a set of rather interpretable low-level geometric features and the resulting semantic segmentation achieved with a Random Forest classifier applied on these features are affected by the quality of the acquired data. The achieved results indicate that, while allowing for a fast acquisition of room geometries, the HoloLens provides data with sufficient accuracy for a wide range of applications.
- Conference Article
208
- 10.1109/cvpr42600.2020.00402
- Jun 1, 2020
Scene understanding has been of high interest in computer vision. It encompasses not only identifying objects in a scene, but also their relationships within the given context. With this goal, a recent line of works tackles 3D semantic segmentation and scene layout prediction. In our work we focus on scene graphs, a data structure that organizes the entities of a scene in a graph, where objects are nodes and their relationships modeled as edges. We leverage inference on scene graphs as a way to carry out 3D scene understanding, mapping objects and their relationships. In particular, we propose a learned method that regresses a scene graph from the point cloud of a scene. Our novel architecture is based on PointNet and Graph Convolutional Networks (GCN). In addition, we introduce 3DSSG, a semiautomatically generated dataset, that contains semantically rich scene graphs of 3D scenes. We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.
- Conference Article
- 10.1109/avss.2013.6636604
- Aug 1, 2013
Summary form only given. Inspired by the ability of humans to interpret and understand 3D scenes nearly effortlessly, the problem of 3D scene understanding has long been advocated as the "holy grail" of computer vision. In the early days this problem was addressed in a bottom-up fashion without enabling satisfactory or reliable results for scenes of realistic complexity. In recent years there has been considerable progress on many sub-problems of the overall 3D scene understanding problem. As the performance for these sub-tasks starts to achieve remarkable performance levels, we argue that the problem to automatically infer and understand 3D scenes should be addressed again. In this talk we will - on the one hand - highlight progress on some essential components of scene understanding such as object class recognition and articulated pose estimation and tracking. On the other hand, we will also report on our current attempt towards 3D scene understanding in the particular case of traffic scene analysis.
- Research Article
40
- 10.1016/j.optlaseng.2021.106767
- Aug 12, 2021
- Optics and Lasers in Engineering
Snapshot hyperspectral imaging polarimetry with full spectropolarimetric resolution
- Conference Article
51
- 10.1109/cvpr52688.2022.01835
- Jun 1, 2022
Although considerable progress has been made in semantic scene understanding under clear weather, it is still a tough problem under adverse weather conditions, such as dense fog, due to the uncertainty caused by imperfect observations. Besides, difficulties in collecting and labeling foggy images hinder the progress of this field. Considering the success in semantic scene understanding under clear weather, we think it is reasonable to transfer knowledge learned from clear images to the foggy domain. As such, the problem becomes to bridge the domain gap between clear images and foggy images. Unlike previous methods that mainly focus on closing the domain gap caused by fog - defogging the foggy images or fogging the clear images, we propose to alleviate the domain gap by considering fog influence and style variation simultaneously. The motivation is based on our finding that the style-related gap and the fog-related gap can be divided and closed respectively, by adding an intermediate domain. Thus, we propose a new pipeline to cumulatively adapt style, fog and the dual-factor (style and fog). Specifically, we devise a unified framework to disentangle the style factor and the fog factor separately, and then the dual-factor from images in different domains. Furthermore, we collaborate the disentanglement of three factors with a novel cumulative loss to thoroughly disentangle these three factors. Our method achieves the state-of-the-art performance on three benchmarks and shows generalization ability in rainy and snowy scenes.
- Book Chapter
8
- 10.1007/978-3-031-20497-5_49
- Jan 1, 2022
Abstract3D scene understanding and generation are to reconstruct the layout of the scene and each object from an RGB image, estimate its semantic type in 3D space and generate a 3D scene. At present, the 3D scene generation algorithm based on deep learning mainly recovers the 3D scene from a single image. Due to the complexity of the real environment, the information provided by a single image is limited, and there are problems such as the lack of single-view information and the occlusion of objects in the scene. In response to the above problems, we propose a 3D scene generation framework SGMT, which realizes multi-view position information fusion and reconstructs the 3D scene from multi-view video time series data to compensate for the missing object position in existing methods. We demonstrated the effectiveness of multi-view scene generation of SGMT on the UrbanScene3D and SUNRGBD dataset and studied the influence of SGCN and joint fine-tuning. In addition, we further explored the transfer ability of the SGMT between datasets and discussed future improvements.Keywords3D scene generationMulti-view fusionMulti-view time series data
- Conference Article
- 10.1109/siscon66686.2025.11409011
- Dec 19, 2025
In this project, we would be exploring the integration of N eRFs and Transformers, creating a hybrid pipeline for 3D Scene Understanding. NeRFs is a novice approach to reconstructing 3D scenes from 2D sparse image inputs. However, there are limitations in spatial understanding and complex scene understanding. Transformers offer a global attention mechanism and feature extraction abilities, and hence leveraging them would improve the spatial representation and coherence of reconstructed scenes. Performance is evaluated on both synthetic and real-world datasets, and bench marked against standard metrics like PSNR and SSIM. This project holds the capability to significantly impact applications in virtual reality, autonomous systems, and augmented reality by advancing the scalability and robustness of 3D scene reconstruction techniques.
- Supplementary Content
- 10.25394/pgs.12184701.v1
- Apr 24, 2020
- Figshare
Computer visualization can effectively deliver instructions to a user whose task requires understanding of a real world scene. Consider the example of surgical telementoring, where a general surgeon performs an emergency surgery under the guidance of a remote mentor. The mentor guidance includes annotations of the operating field, which conventionally are displayed to the surgeon on a nearby monitor. However, this conventional visualization of mentor guidance requires the surgeon to look back and forth between the monitor and the operating field, which can lead to cognitive load, delays, or even medical errors. Another example is 3D acquisition of a real-world scene, where an operator must acquire multiple images of the scene from specific viewpoints to ensure appropriate scene coverage and thus achieve quality 3D reconstruction. The conventional approach is for the operator to plan the acquisition locations using conventional visualization tools, and then to try to execute the plan from memory, or with the help of a static map. Such approaches lead to incomplete coverage during acquisition, resulting in an inaccurate reconstruction of the 3D scene which can only be addressed at the high and sometimes prohibitive cost of repeating acquisition.Augmented reality (AR) promises to overcome the limitations of conventional out-of-context visualization of real world scenes by delivering visual guidance directly into the user's field of view, guidance that remains in-context throughout the completion of the task. In this thesis, we propose and validate several AR visual interfaces that provide effective visual guidance for task completion in the context of surgical telementoring and 3D scene acquisition.A first AR interface provides a mentee surgeon with visual guidance from a remote mentor using a simulated transparent display. A computer tablet suspended above the patient captures the operating field with its on-board video camera, the live video is sent to the mentor who annotates it, and the annotations are sent back to the mentee where they are displayed on the tablet, integrating the mentor-created annotations directly into the mentee's view of the operating field. We show through user studies that surgical task performance improves when using the AR surgical telementoring interface compared to when using the conventional visualization of the annotated operating field on a nearby monitor. A second AR surgical telementoring interface provides the mentee surgeon with visual guidance through an AR head-mounted display (AR HMD). We validate this approach in user studies with medical professionals in the context of practice cricothyrotomy and lower-limb fasciotomy procedures, and show improved performance over conventional surgical guidance. A comparison between our simulated transparent display and our AR HMD surgical telementoring interfaces reveals that the HMD has the advantages of reduced workspace encumbrance and of correct depth perception of annotations, whereas the transparent display has the advantage of reduced surgeon head and neck encumbrance and of annotation visualization quality. A third AR interface provides operator guidance for effective image-based modeling and rendering of real-world scenes. During the modeling phase, the AR interface builds and dynamically updates a map of the scene that is displayed to the user through an AR HMD, which leads to the efficient acquisition of a five-degree-of-freedom image-based model of large, complex indoor environments. During rendering, the interface guides the user towards the highest-density parts of the image-based model which result in the highest output image quality. We show through a study that first-time users of our interface can acquire a quality image-based model of a 13m $\times$ 10m indoor environment in 7 minutes.A fourth AR interface provides operator guidance for effective capture of a 3D scene in the context of photogrammetric reconstruction. The interface relies on an AR HMD with a tracked hand-held camera rig to construct a sufficient set of six-degrees-of-freedom camera acquisition poses and then to steer the user to align the camera with the prescribed poses quickly and accurately. We show through a study that first-time users of our interface are significantly more likely to achieve complete 3D reconstructions compared to conventional freehand acquisition. We then investigated the design space of AR HMD interfaces for mid-air pose alignment with an added ergonomics concern, which resulted in five candidate interfaces that sample this design space. A user study identified the aspects of the AR interface design that influence the ergonomics during extended use, informing AR HMD interface design for the important task of mid-air pose alignment.