From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics
The acquisition of high-quality three-dimensional (3D) models of cultural relics often relies on expensive scanning equipment or multi-view image capture, which limits large-scale deployment in real-world heritage conservation scenarios. Large-scale water impoundment in the Three Gorges region has resulted in the permanent submergence of numerous cultural relics and archaeological remains. For many of these artifacts, only a single two-dimensional image remains as the sole visual record, posing significant challenges for reconstructing their original three-dimensional geometry and appearance. This limitation renders traditional multi-view reconstruction and physical scanning methods infeasible. To address this challenge, we propose a generative framework for reconstructing high-fidelity 3D digital models of Chinese Three Gorges cultural relics from a single two-dimensional (2D) image. Building upon recent advances in generative 3D representation learning, the proposed method adopts a transformer-based image-to-triplane architecture to infer an implicit 3D representation directly from a single RGB image. A vision transformer encoder is employed to extract global and local visual features, which are subsequently projected into a compact triplane representation through a cross-attention-based decoder. The reconstructed triplane features are further decoded by a neural radiance field (NeRF) to synthesize dense geometry and appearance, enabling accurate mesh extraction and novel-view rendering. To enhance robustness under in-the-wild conditions, the model implicitly estimates camera parameters during inference without relying on explicit calibration information. The proposed method is evaluated on a dataset of Chinese Three Gorges cultural relics, covering diverse artifact categories and visual styles. Experimental results demonstrate that the proposed framework is capable of producing structurally coherent and visually consistent 3D reconstructions from a single image, effectively preserving key morphological characteristics of cultural relics under limited data conditions. Compared with existing single-image and multi-view reconstruction baselines, the proposed framework exhibits better reconstruction accuracy, visual consistency, and generalization capability. This study provides an efficient and scalable solution for the digital reconstruction of cultural relics and offers a practical pathway for large-scale 3D digitization of heritage artifacts from archival images. This work provides a practical solution for the digital reconstruction of submerged heritage artifacts and contributes to the application of generative 3D modeling techniques in cultural heritage preservation and restoration.
- Research Article
4
- 10.1016/j.cad.2023.103514
- Mar 21, 2023
- Computer-Aided Design
Wrinkles Realistic Clothing Reconstruction by Combining Implicit and Explicit Method
- Research Article
5
- 10.1016/j.cag.2022.11.010
- Nov 25, 2022
- Computers & Graphics
Robust and automatic clothing reconstruction based on a single RGB image
- Conference Article
48
- 10.1109/iccvw.2019.00439
- Oct 1, 2019
In contrast to the current literature, we address the problem of estimating the spectrum from a single common trichromatic RGB image obtained under unconstrained settings (e.g. unknown camera parameters, unknown scene radiance, unknown scene contents). For this we use a reference spectrum as provided by a hyperspectral image camera, and propose efficient deep learning solutions for sensitivity function estimation and spectral reconstruction from a single RGB image. We further expand the concept of spectral reconstruction such that to work for RGB images taken in the wild and propose a solution based on a convolutional network conditioned on the estimated sensitivity function. Besides the proposed solutions, we study also generic and sensitivity specialized models and discuss their limitations. We achieve state-of-the-art competitive results on the standard example-based spectral reconstruction benchmarks: ICVL, CAVE and NUS. Moreover, our experiments show that, for the first time, accurate spectral estimation from a single RGB image in the wild is within our reach.
- Book Chapter
24
- 10.1007/978-3-030-20870-7_30
- Jan 1, 2019
In this paper, we present Skeleton Transformer Networks (SkeletonNet), an end-to-end framework that can predict not only 3D joint positions but also 3D angular pose (bone rotations) of a human skeleton from a single color image. This in turn allows us to generate skinned mesh animations. Here, we propose a two-step regression approach. The first step regresses bone rotations in order to obtain an initial solution by considering skeleton structure. The second step performs refinement based on heatmap regressor using a 3D pose representation called cross heatmap which stacks heatmaps of xy and zy coordinates. By training the network using the proposed 3D human pose dataset that is comprised of images annotated with 3D skeletal angular poses, we showed that SkeletonNet can predict a full 3D human pose (joint positions and bone rotations) from a single image in-the-wild.
- Conference Article
43
- 10.1109/cvpr42600.2020.00605
- Jun 1, 2020
Recovering the 3D shape of a person from its 2D appearance is ill-posed due to ambiguities. Nevertheless, with the help of convolutional neural networks (CNN) and prior knowledge on the 3D human body, it is possible to overcome such ambiguities to recover detailed 3D shapes of human bodies from single images. Current solutions, however, fail to reconstruct all the details of a person wearing loose clothes. This is because of either (a) huge memory requirement that cannot be maintained even on modern GPUs or (b) the compact 3D representation that cannot encode all the details. In this paper, we propose the tetrahedral outer shell volumetric truncated signed distance function (TetraTSDF) model for the human body, and its corresponding part connection network (PCN) for 3D human body shape regression. Our proposed model is compact, dense, accurate, and yet well suited for CNN-based regression task. Our proposed PCN allows us to learn the distribution of the TSDF in the tetrahedral volume from a single image in an end-to-end manner. Results show that our proposed method allows to reconstruct detailed shapes of humans wearing loose clothes from single RGB images.
- Conference Article
1
- 10.1117/12.2623417
- Feb 16, 2022
Despite that 3D human body reconstruction from a single image has obtained rapid progress in recent years, most methods aim at the body without the hands and face. However, hand gestures and facial expressions are also important for delivering human intentions or emotions. This paper proposes a method for holistic 3D reconstruction of the human body from a single RGB image, including hands, body, and face. Our approach is based on the SMPL eXpressive (SMPL-X), a unified 3D parametric human body model of body, hands, and face. Since it is difficult to exactly regress the model's parameters of different body parts by a single framework, we use a divide-and-conquer strategy for the whole human body reconstruction. We exploit different deep neural networks to predict the hand, body, and head model's parameters, then integrate them into an entire 3D model to realize a holistic and expressive 3D human body reconstruction. Simulation results demonstrate that our method has obtained state-of-the-art performance with better facial expression.
- Research Article
5
- 10.1177/00405175221118105
- Aug 15, 2022
- Textile Research Journal
Hyperspectral images are capable of significantly increasing the accuracy of textile color measurement because of their rich information. However, hyperspectral imaging generally requires expensive equipment and complex operations. If the hyperspectral information can be reconstructed based on a single RGB image, it can facilitate the widespread application of hyperspectral imaging technology, such as in textile color measurement. In this paper, a deep learning model was proposed for hyperspectral reconstruction of cotton and linen fabrics based on the conditional generative adversarial network. According to this model, the encoder–decoder structure and spatial pyramid convolution pooling operation were adopted to fuse multi-scale features for the prevention of mode collapse. Atrous convolution was introduced to increase the receptive field to adapt to the fabric texture information, and the hyperspectral information of the fabric from a single RGB image was reconstructed. The quantitative and qualitative tests verified that the method in this paper had good results. The root mean square error and peak signal-to-noise ratio were 0.0271 and 31.372, respectively, for reconstructed fabric hyperspectral images; the highest average color difference [Formula: see text] in the reconstructed hyperspectral colorimetry experiment was obtained as 2.755. Thus, the proposed method can meet the common application requirements of color measurement.
- Conference Article
106
- 10.1109/cvpr52688.2022.01292
- Jun 1, 2022
Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed significant progress, reasoning about 3D human-scene contact from a single image is still challenging. Existing HSC detection methods consider only a few types of predefined contact, often reduce the body and scene to a small number of primitives, and even overlook image evidence. To predict human-scene contact from a single image, we address the limitations above from both data and algorithmic perspectives. We capture a new dataset called RICH for “Real scenes, Interaction, Contact and Humans.” RICH contains multiview outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies captured using markerless motion capture, 3D body scans, and high resolution 3D scene scans. A key feature of RICH is that it also contains accurate vertex-level contact labels on the body. Using RICH, we train a network that predicts dense body-scene contacts from a single RGB image. Our key insight is that regions in contact are always occluded so the network needs the ability to explore the whole image for evidence. We use a transformer to learn such non-local relationships and propose a new Body-Scene contact TRansfOrmer (BSTRO). Very few methods explore 3D contact; those that do focus on the feet only, detect foot contact as a post-processing step, or infer contact from body pose without looking at the scene. To our knowledge, BSTRO is the first method to directly estimate 3D body-scene contact from a single image. We demonstrate that BSTRO significantly outperforms the prior art. Our code and dataset are available for research purposes at: https://rich.is.tue.mpg.de.
- Research Article
- 10.1111/jmi.13418
- May 6, 2025
- Journal of microscopy
Despite the development of 3D imaging technology, the reconstruction of three-dimensional (3D) microstructure from a single two-dimensional (2D) image is still a prominent problem. In this paper, we propose a hierarchical reconstruction method based on simulated annealing, which is named hierarchical simulated annealing method (HSA), with the multiscale entropy statistics as the morphological information descriptor to reconstruct its corresponding three-dimensional (3D) microstructure from a single two-dimensional (2D) image. Both hierarchical simulated annealing (HSA) method and simulated annealing (SA) method are used to perform on the 2D and 3D microstructure reconstruction from a single 2D image, where the two-point cluster function and the standard two-point correlation function are used as the measurement metrics for the reconstructed 2D and 3D structures. From the 2D reconstructions, it can be seen that all the reconstructions of HSA method and SA method not only captures the similar morphological information with the original images, but also have a good agreement with the target microstructures in two-point cluster function. For the reconstructed 3D microstructures, the comparison of two-point correlation function shows that both HSA method and SA method can effectively reconstruct its 3D microstructure and the comparison of the reconstruction time between HSA method and SA method shows that the reconstruction speed of HSA method is an order of magnitude faster than that of SA method.
- Conference Article
5
- 10.1109/icspcc.2016.7753653
- Aug 1, 2016
The three-dimensional (3D) microscopic pore structure of Reservoir rock directly affects its seepage characteristics and physical properties. A 3D microscopic pore structure can be reconstructed from a single two-dimensional (2D) training image (TI) by using mathematical modeling methods. In this paper, we introduce the concepts of blocks, dictionary and learning into the reconstruction of 3D porous media from the area of example-based super-resolution (SR) reconstruction, and put forward the concept of super-dimension (SD) reconstruction: study the corresponding relations between 2D images and 3D images of real microscopic pore structure of reservoir rock, and use these relations as guidance for the reconstructions of a new 2D image. According to the concept of SD reconstruction, we put forward a new learning-based super-dimension (LBSD) reconstruction algorithm whose basic steps are as follows: (1) Select the training set; (2) build the dictionary; (3) reconstruction. Based on these steps, we did experiments on reconstruction of porous media from a single two-dimensional image. Comprehensive tests show that the reconstructed 3D structure consists with the 3D Micro-CT core sample where the 2D TI is selected from both in morphological characteristics and Statistical characteristics.
- Research Article
- 10.1002/sdtp.17206
- Apr 1, 2024
- SID Symposium Digest of Technical Papers
This paper proposes a multi‐view autostereoscopic content synthesis technique for glasses‐free 3D display devices. This technology uses deep learning methods to construct 3D implicit representations to ensure geometric consistency between viewpoints. At inferencing time, our method takes a single RGB image as input and generates a 3D implicit representation of the input view frustum. When rendering new viewpoints, the size of the 3D content's parallax, the screen‐in/out ratio, and the number of viewpoints can be adjusted by setting the external parameters of the virtual camera. Thus, it can adapt to glasses‐free 3D display devices with different optical designs.
- Research Article
126
- 10.1145/3381866
- Apr 9, 2020
- ACM Transactions on Graphics
We present a deep generative scene modeling technique for indoor environments. Our goal is to train a generative model using a feed-forward neural network that maps a prior distribution (e.g., a normal distribution) to the distribution of primary objects in indoor scenes. We introduce a 3D object arrangement representation that models the locations and orientations of objects, based on their size and shape attributes. Moreover, our scene representation is applicable for 3D objects with different multiplicities (repetition counts), selected from a database. We show a principled way to train this model by combining discriminative losses for both a 3D object arrangement representation and a 2D image-based representation. We demonstrate the effectiveness of our scene representation and the network training method on benchmark datasets. We also show the applications of this generative model in scene interpolation and scene completion.
- Research Article
26
- 10.1016/j.cag.2021.01.002
- Jan 29, 2021
- Computers & Graphics
Animated 3D human avatars from a single image with GAN-based texture inference
- Research Article
2
- 10.1109/tvcg.2024.3363493
- Feb 1, 2025
- IEEE transactions on visualization and computer graphics
Recovering a user-special and controllable human model from a single RGB image is a nontrivial challenge. Existing methods usually generate static results with an image consistent subject's pose. Our work aspires to achieve pose-controllable human reconstruction from a single image by learning a dynamic (multi-pose) implicit field. We first construct a feature-embedded human model (FEHM) as a bridge to propagate image features to different pose spaces. Based on FEHM, we then encode three pose-decoupled features. Global image features represent user-specific shapes in images and replace widely used pixel-aligned ways to avoid unwanted shape-pose entanglement. Spatial color features propagate FEHM-embedded image cues into 3D pose space to provide spatial high-frequency guidance. Spatial geometry features improve reconstruction robustness by using the surface shape of the FEHM as the prior. Finally, new implicit functions are designed to predict the dynamic human implicit fields. For effective supervision, a realistic human avatar dataset, SimuSCAN, with 1000+ models is constructed using a low-cost hierarchical mesh registration method. Extensive experiments demonstrate that our method achieves the state-of-the-art reconstruction level.
- Research Article
68
- 10.1109/tci.2020.2981761
- Jan 1, 2020
- IEEE Transactions on Computational Imaging
Depth prediction from single image is a challenging task due to the intra scale ambiguity and unavailability of prior information. The prediction of an unambiguous depth from single RGB image is very important aspect for computer vision applications. In this paper, an end-to-end sparse-to-dense network (S2DNet) is proposed for single image depth estimation (SIDE). The proposed network processes single image along with the additional sparse depth samples for depth estimation. The additional sparse depth sample are acquired either with a low-resolution depth sensor or calculated by visual simultaneous localization and mapping (SLAM) algorithms. In first stage, the proposed S2DNet estimates coarse-level depth map using sparse-to-dense coarse network (S2DCNet). In second stage, the estimated coarse-level depth map is concatenated with the input image and used as an input to the sparse-to-dense fine network (S2DFNet) for fine-level depth map estimation. The proposed S2DFNet comprises of attention map architecture which helps to estimate the prominent depth information. The quantitative and qualitative performance evaluation of the proposed network has been carried out using the error metrics. We perform complete evaluation of S2DNet on four publically available benchmark data sets i.e. NYU Depth-V2 indoor dataset [1] , KITTI odometry outdoor dataset [2] , KITTI depth completion test database [3] and SUN-RGB database [4] . Further, we have extended the proposed S2DNet for image de-hazing. The experimental analysis shows that the proposed S2DNet outperforms the existing state-of-the-art methods for both single image depth estimation and image de-hazing.