Hierarchical and Progressive Image Matting
Most matting research resorts to advanced semantics to achieve high-quality alpha mattes, and a direct low-level features combination is usually explored to complement alpha details. However, we argue that appearance-agnostic integration can only provide biased foreground (FG) details and that alpha mattes require different-level feature aggregation for better pixel-wise opacity perception. In this article, we propose an end-to-end hierarchical and progressive attention matting network (HAttMatting++), which can better predict the opacity of the FG from single RGB images without additional input. Specifically, we utilize channel-wise attention (CA) to distill pyramidal features and employ spatial attention (SA) at different levels to filter appearance cues. This progressive attention mechanism can estimate alpha mattes from adaptive semantics and semantics-indicated boundaries. We also introduce a hybrid loss function fusing structural similarity, mean square error, adversarial loss, and sentry supervision to guide the network to further improve the overall FG structure. In addition, we construct a large-scale and challenging image matting dataset comprised of 59,000 training images and 1,000 test images (a total of 646 distinct FG alpha mattes), which can further improve the robustness of our hierarchical and progressive aggregation model. Extensive experiments demonstrate that the proposed HAttMatting++ can capture sophisticated FG structures and achieve state-of-the-art performance with single RGB images as input.
- Conference Article
186
- 10.1109/cvpr42600.2020.01369
- Jun 1, 2020
Existing deep learning based matting algorithms primarily resort to high-level semantic features to improve the overall structure of alpha mattes. However, we argue that advanced semantics extracted from CNNs contribute unequally for alpha perception and we are supposed to reconcile advanced semantic information with low-level appearance cues to refine the foreground details. In this paper, we propose an end-to-end Hierarchical Attention Matting Network (HAttMatting), which can predict the better structure of alpha mattes from single RGB images without additional input. Specifically, we employ spatial and channel-wise attention to integrate appearance cues and pyramidal features in a novel fashion. This blended attention mechanism can perceive alpha mattes from refined boundaries and adaptive semantics. We also introduce a hybrid loss function fusing Structural SIMilarity (SSIM), Mean Square Error (MSE) and Adversarial loss to guide the network to further improve the overall foreground structure. Besides, we construct a large-scale image matting dataset comprised of 59,600 training images and 1000 test images (total 646 distinct foreground alpha mattes), which can further improve the robustness of our hierarchical structure aggregation model. Extensive experiments demonstrate that the proposed HAttMatting can capture sophisticated foreground structure and achieve state-of-the-art performance with single RGB images as input.
- Conference Article
48
- 10.1109/iccvw.2019.00439
- Oct 1, 2019
In contrast to the current literature, we address the problem of estimating the spectrum from a single common trichromatic RGB image obtained under unconstrained settings (e.g. unknown camera parameters, unknown scene radiance, unknown scene contents). For this we use a reference spectrum as provided by a hyperspectral image camera, and propose efficient deep learning solutions for sensitivity function estimation and spectral reconstruction from a single RGB image. We further expand the concept of spectral reconstruction such that to work for RGB images taken in the wild and propose a solution based on a convolutional network conditioned on the estimated sensitivity function. Besides the proposed solutions, we study also generic and sensitivity specialized models and discuss their limitations. We achieve state-of-the-art competitive results on the standard example-based spectral reconstruction benchmarks: ICVL, CAVE and NUS. Moreover, our experiments show that, for the first time, accurate spectral estimation from a single RGB image in the wild is within our reach.
- Research Article
5
- 10.1177/00405175221118105
- Aug 15, 2022
- Textile Research Journal
Hyperspectral images are capable of significantly increasing the accuracy of textile color measurement because of their rich information. However, hyperspectral imaging generally requires expensive equipment and complex operations. If the hyperspectral information can be reconstructed based on a single RGB image, it can facilitate the widespread application of hyperspectral imaging technology, such as in textile color measurement. In this paper, a deep learning model was proposed for hyperspectral reconstruction of cotton and linen fabrics based on the conditional generative adversarial network. According to this model, the encoder–decoder structure and spatial pyramid convolution pooling operation were adopted to fuse multi-scale features for the prevention of mode collapse. Atrous convolution was introduced to increase the receptive field to adapt to the fabric texture information, and the hyperspectral information of the fabric from a single RGB image was reconstructed. The quantitative and qualitative tests verified that the method in this paper had good results. The root mean square error and peak signal-to-noise ratio were 0.0271 and 31.372, respectively, for reconstructed fabric hyperspectral images; the highest average color difference [Formula: see text] in the reconstructed hyperspectral colorimetry experiment was obtained as 2.755. Thus, the proposed method can meet the common application requirements of color measurement.
- Research Article
5
- 10.1016/j.cag.2022.11.010
- Nov 25, 2022
- Computers & Graphics
Robust and automatic clothing reconstruction based on a single RGB image
- Conference Article
179
- 10.1109/cvpr.2019.00765
- Jun 1, 2019
This paper studies the structure of a deep convolutional neural network to predict the foreground alpha matte by taking a single RGB image as input. Our network is fully convolutional with two decoder branches for the foreground and background classification respectively. Then a fusion branch is used to integrate the two classification results which gives rise to alpha values as the soft segmentation result. This design provides more degrees of freedom than a single decoder branch for the network to obtain better alpha values during training. The network can implicitly produce trimaps without user interaction, which is easy to use for novices without expertise in digital matting. Experimental results demonstrate that our network can achieve high-quality alpha mattes for various types of objects and outperform the state-of-the-art CNN-based image matting methods on the human image matting task.
- Research Article
57
- 10.3390/rs12193258
- Oct 7, 2020
- Remote Sensing
Hyperspectral imaging has many applications. However, the high device costs and low hyperspectral image resolution are major obstacles limiting its wider application in agriculture and other fields. Hyperspectral image reconstruction from a single RGB image fully addresses these two problems. The robust HSCNN-R model with mean relative absolute error loss function and evaluated by the Mean Relative Absolute Error metric was selected through permutation tests from models with combinations of loss functions and evaluation metrics, using tomato as a case study. Hyperspectral images were subsequently reconstructed from single tomato RGB images taken by a smartphone camera. The reconstructed images were used to predict tomato quality properties such as the ratio of soluble solid content to total titratable acidity and normalized anthocyanin index. Both predicted parameters showed very good agreement with corresponding “ground truth” values and high significance in an F test. This study showed the suitability of hyperspectral image reconstruction from single RGB images for fruit quality control purposes, underpinning the potential of the technology—recovering hyperspectral properties in high resolution—for real-world, real time monitoring applications in agriculture any beyond.
- Book Chapter
4
- 10.1007/978-3-030-01424-7_31
- Jan 1, 2018
In this work, we propose a new approach for 3D human pose estimation from a single monocular RGB image based on a deep convolutional neural network (CNN). The proposed method depends on reducing the huge search space of the continuous-valued 3D human poses by discretizing and approximating these continuous poses into many discrete key-poses. These key-poses constitute more restricted search space and then can be considered as multiple-class candidates of 3D human poses.
- Conference Article
3
- 10.1109/iros40897.2019.8967616
- Nov 1, 2019
In order to operate autonomously, a robot should explore the environment and build a model of each of the surrounding objects. A common approach is to carefully scan the whole workspace. This is time-consuming. It is also often impossible to reach all the viewpoints required to acquire full knowledge about the environment. Humans can perform shape completion of occluded objects by relying on past experience. Therefore, we propose a method that generates images of an object from various viewpoints using a single input RGB image. A deep neural network is trained to imagine the object appearance from many viewpoints. We present the whole pipeline, which takes a single RGB image as input and returns a sequence of RGB and depth images of the object. The method utilizes a CNN-based object detector to extract the object from the natural scene. Then, the proposed network generates a set of RGB and depth images. We show the results both on a synthetic dataset and on real images.
- Conference Article
425
- 10.1109/iccv.2019.00783
- Oct 1, 2019
We propose DeepHuman, an image-guided volume-to-volume translation CNN for 3D human reconstruction from a single RGB image. To reduce the ambiguities associated with the surface geometry reconstruction, even for the reconstruction of invisible areas, we propose and leverage a dense semantic representation generated from SMPL model as an additional input. One key feature of our network is that it fuses different scales of image features into the 3D space through volumetric feature transformation, which helps to recover accurate surface geometry. The visible surface details are further refined through a normal refinement network, which can be concatenated with the volume generation network using our proposed volumetric normal projection layer. We also contribute THuman, a 3D real-world human model dataset containing about 7000 models. The network is trained using training data generated from the dataset. Overall, due to the specific design of our network and the diversity in our dataset, our method enables 3D human model estimation given only a single image and outperforms state-of-the-art approaches.
- Conference Article
1
- 10.1117/12.2623417
- Feb 16, 2022
Despite that 3D human body reconstruction from a single image has obtained rapid progress in recent years, most methods aim at the body without the hands and face. However, hand gestures and facial expressions are also important for delivering human intentions or emotions. This paper proposes a method for holistic 3D reconstruction of the human body from a single RGB image, including hands, body, and face. Our approach is based on the SMPL eXpressive (SMPL-X), a unified 3D parametric human body model of body, hands, and face. Since it is difficult to exactly regress the model's parameters of different body parts by a single framework, we use a divide-and-conquer strategy for the whole human body reconstruction. We exploit different deep neural networks to predict the hand, body, and head model's parameters, then integrate them into an entire 3D model to realize a holistic and expressive 3D human body reconstruction. Simulation results demonstrate that our method has obtained state-of-the-art performance with better facial expression.
- Conference Article
6
- 10.1109/ijcnn48605.2020.9207286
- Jul 1, 2020
The 6D object pose obtained from single RGB image has broad applications such as robotic manipulation and virtual reality. Among many existing methods, the deep learning-based approaches for object pose estimation from single RGB image are widely used. However, they often require a large amount of training data, which has great challenges in high cost of data collection and lack of 3D information. In this paper, we introduce an object pose estimation architecture that takes a single RGB image as input and directly outputs rotation angles and translation vectors. A data generation pipeline that applies the idea of domain randomization is used to generate millions of low-quality rendering images. Then the pose estimation is realized by fusing the architecture and the domain randomization approach to utilize the generated information and low the data collection cost. We synthesized a big dataset called Pose6DDR whose images are similar to those in the LineMod dataset. Experiments demonstrated the effectiveness of the proposed 6D object pose estimation architecture as compared to the relevant competing technologies.
- Research Article
- 10.1109/lra.2025.3577460
- Jul 1, 2025
- IEEE Robotics and Automation Letters
Monocular RGB-based category-level object pose estimation is more practical and cost-effective for robotics. However, existing methods do not fully exploit the rich semantic and contextual information in multimodal data (e.g. language) that provides additional object attributes to guide the model in extracting category features more reliably. We propose a language-guided category-level object pose estimation method (LanCOPE), taking a single RGB image as input. Our method uses DINOv2 to recover depth from a single RGB image and converts it into point cloud to perceive the object's geometry. We then introduce language descriptions for the RGB image, estimated point cloud and overall scene to better guide the point cloud encoder and image encoder in learning category features. We develop a cross-modal differential perception feature fusion network to fuse multimodal features. This network employs a differential perception module to eliminate redundant information across different modalities, highlighting signifcant semantic differences and similarities. Furthermore, it uses a cross-attention mechanism to fuse the semantic information of the language and vision features, improving the overall perception. Finally, we design a denoising network based on the skip fusion transformer to recover the object pose accurately. Extensive experiments on REAL275 and Wild6D datasets show that LanCOPE achieves state-of-the-art performance. Our code is available at LanCOPE.
- Conference Article
255
- 10.1109/cvpr.2018.00273
- Jun 1, 2018
This paper proposes a deep neural network (DNN) for piece-wise planar depthmap reconstruction from a single RGB image. While DNNs have brought remarkable progress to single-image depth prediction, piece-wise planar depthmap reconstruction requires a structured geometry representation, and has been a difficult task to master even for DNNs. The proposed end-to-end DNN learns to directly infer a set of plane parameters and corresponding plane segmentation masks from a single RGB image. We have generated more than 50,000 piece-wise planar depthmaps for training and testing from ScanNet, a large-scale RGBD video database. Our qualitative and quantitative evaluations demonstrate that the proposed approach outperforms baseline methods in terms of both plane segmentation and depth estimation accuracy. To the best of our knowledge, this paper presents the first end-to-end neural architecture for piece-wise planar reconstruction from a single RGB image. Code and data are available at https://github.com/art-programmer/PlaneNet.
- Research Article
- 10.1145/3734873
- May 8, 2025
- ACM Transactions on Multimedia Computing, Communications, and Applications
Reconstructing 3D hand from a single RGB image is a very challenging task. Most of the existing Transformer-based 3D hand reconstructing methods do not fully consider the local spatial information from low-level image features, which would be crucial for capturing fine details and accurate shapes of the hand. Consequently, this oversight often leads to reconstructed hands that lack the precision and realism necessary for many applications, such as augmented reality, and hand gesture recognition. To address this limitation, in this paper, we propose a novel and efficient method named HybridMETRO to both utilize low-level and high-level image features for accurate reconstructing 3D hand pose and mesh vertices from a single RGB image. Specifically, we introduce the deformable attention into the encoder of Transformer, making it no longer limited by the length of the image feature sequence. Based on above mechanism, we further propose an interleaved updating multi-scale feature encoder to fuse low-level and high-level features. Moreover, we incorporate the Graph Convolutional Residual (GCR) module to build a novel decoder to capture explicit semantic connections between grid vertices and thus improve spatial locality of extracted features. Experimental results demonstrate that, when compared with state-of-the-art methods, our proposed HybridMETRO could achieve better performance with significantly smaller model parameters that are about half of METRO’s and a quarter of HandOccNet’s.
- Research Article
16
- 10.1109/access.2021.3060435
- Jan 1, 2021
- IEEE Access
Estimating the depth map from a single RGB image is important to understand the nature of the terrain in robot navigation and has attracted considerable attention in the past decade. The existing approaches can accurately estimate the depth from a single RGB image, considering a highly structured environment. The problem becomes more challenging when the terrain is highly dynamic. We propose a fine-tuned generative adversarial network to estimate the depth map effectively for a given single RGB image. The proposed network is composed of a fine-tuned generator and a global discriminator. The encoder part of the generator takes input RGB images and depth maps and generates their joint distribution in the latent space. Subsequently, the decoder part of the generator decodes the depth map from the joint distribution. The discriminator takes real and fake pairs in three different configurations and then guides the generator to estimate the depth map from the given RGB image accordingly. Finally, we conducted extensive experiments with a highly dynamic environment dataset for verifying the effectiveness and feasibility of the proposed approach. The proposed approach could decode the depth map from the joint distribution more effectively and accurately than the existing approaches.