A Landmark‐Free 3D–2D Rigid Liver Registration via Point Cloud Matching for Laparoscopic Surgery
ABSTRACTReal‐time registration of preoperative 3D liver models to intraoperative 2D laparoscopic images is essential for augmented reality navigation in minimally invasive liver surgery. However, 3D–2D registration typically depends on anatomical landmarks extraction and pose estimation based on iterative projection‐based landmark distance computation, which is time‐consuming. Unlike iterative pose refinement strategies, our method treats liver pose estimation as a partial‐to‐complete point matching problem. First, our method leverages monocular depth estimation to reconstruct partial intraoperative point clouds from a single RGB image. Then, a two‐stage point matching framework establishes dense 3D–3D correspondences, ultimately inferring the 6‐DoF rigid pose by solving a weighted SVD over the matched point pairs. The experiments on the P2ILF dataset have a reprojection error of 126.37 ± 48.98 pixels and a target registration error of 25.20 mm on the LLR‐LUS dataset. These results indicate that our method achieves promising accuracy and efficiency in aligning preoperative models to intraoperative scenes, suggesting its potential for practical rigid alignment in near real‐time laparoscopic liver AR navigation.
- Conference Article
42
- 10.1109/iccvw.2019.00439
- Oct 1, 2019
In contrast to the current literature, we address the problem of estimating the spectrum from a single common trichromatic RGB image obtained under unconstrained settings (e.g. unknown camera parameters, unknown scene radiance, unknown scene contents). For this we use a reference spectrum as provided by a hyperspectral image camera, and propose efficient deep learning solutions for sensitivity function estimation and spectral reconstruction from a single RGB image. We further expand the concept of spectral reconstruction such that to work for RGB images taken in the wild and propose a solution based on a convolutional network conditioned on the estimated sensitivity function. Besides the proposed solutions, we study also generic and sensitivity specialized models and discuss their limitations. We achieve state-of-the-art competitive results on the standard example-based spectral reconstruction benchmarks: ICVL, CAVE and NUS. Moreover, our experiments show that, for the first time, accurate spectral estimation from a single RGB image in the wild is within our reach.
- Research Article
1
- 10.3390/s24175474
- Aug 23, 2024
- Sensors (Basel, Switzerland)
Accurate 6DoF (degrees of freedom) pose and focal length estimation are important in extended reality (XR) applications, enabling precise object alignment and projection scaling, thereby enhancing user experiences. This study focuses on improving 6DoF pose estimation using single RGB images of unknown camera metadata. Estimating the 6DoF pose and focal length from an uncontrolled RGB image, obtained from the internet, is challenging because it often lacks crucial metadata. Existing methods such as FocalPose and Focalpose++ have made progress in this domain but still face challenges due to the projection scale ambiguity between the translation of an object along the z-axis (tz) and the camera's focal length. To overcome this, we propose a two-stage strategy that decouples the projection scaling ambiguity in the estimation of z-axis translation and focal length. In the first stage, tz is set arbitrarily, and we predict all the other pose parameters and focal length relative to the fixed tz. In the second stage, we predict the true value of tz while scaling the focal length based on the tz update. The proposed two-stage method reduces projection scale ambiguity in RGB images and improves pose estimation accuracy. The iterative update rules constrained to the first stage and tailored loss functions including Huber loss in the second stage enhance the accuracy in both 6DoF pose and focal length estimation. Experimental results using benchmark datasets show significant improvements in terms of median rotation and translation errors, as well as better projection accuracy compared to the existing state-of-the-art methods. In an evaluation across the Pix3D datasets (chair, sofa, table, and bed), the proposed two-stage method improves projection accuracy by approximately 7.19%. Additionally, the incorporation of Huber loss resulted in a significant reduction in translation and focal length errors by 20.27% and 6.65%, respectively, in comparison to the Focalpose++ method.
- Book Chapter
1
- 10.1007/978-3-031-31417-9_2
- Jan 1, 2023
The ability of a robot to sense and “perceive" its surroundings to interact and influence various objects of interest by grasping them, using vision-based sensors is the main principle behind vision based Autonomous Robotic Grasping. To realise this task of autonomous object grasping, one of the critical sub-tasks is the 6D Pose Estimation of a known object of interest from sensory data in a given environment. The sensory data can include RGB images and data from depth sensors, but determining the object’s pose using only a single RGB image is cost-effective and highly desirable in many applications. In this work, we develop a series of convolutional neural network-based pose estimation models without post-refinement stages, designed to achieve high accuracy on relevant metrics for efficiently estimating the 6D pose of an object, using only a single RGB image. The designed models are incorporated into an end-to-end pose estimation pipeline based on Unity and ROS Noetic, where a UR3 Robotic Arm is deployed in a simulated pick-and-place task. The pose estimation performance of the different models is compared and analysed in both same-environment and cross-environment cases utilising synthetic RGB data collected from cluttered and simple simulation scenes constructed in Unity Environment. In addition, the developed models achieved high Average Distance (ADD) metric scores greater than 93% for most of the real-life objects tested in the LINEMOD dataset and can be integrated seamlessly with any robotic arm for estimating 6D pose from only RGB data, making our method effective, efficient and generic.
- Research Article
11
- 10.1016/j.neucom.2021.12.013
- Dec 10, 2021
- Neurocomputing
3D interacting hand pose and shape estimation from a single RGB image
- Conference Article
4
- 10.1109/ijcnn48605.2020.9207286
- Jul 1, 2020
The 6D object pose obtained from single RGB image has broad applications such as robotic manipulation and virtual reality. Among many existing methods, the deep learning-based approaches for object pose estimation from single RGB image are widely used. However, they often require a large amount of training data, which has great challenges in high cost of data collection and lack of 3D information. In this paper, we introduce an object pose estimation architecture that takes a single RGB image as input and directly outputs rotation angles and translation vectors. A data generation pipeline that applies the idea of domain randomization is used to generate millions of low-quality rendering images. Then the pose estimation is realized by fusing the architecture and the domain randomization approach to utilize the generated information and low the data collection cost. We synthesized a big dataset called Pose6DDR whose images are similar to those in the LineMod dataset. Experiments demonstrated the effectiveness of the proposed 6D object pose estimation architecture as compared to the relevant competing technologies.
- Research Article
5
- 10.1155/2020/8432840
- Jul 15, 2020
- Mathematical Problems in Engineering
3D hand pose estimation can provide basic information about gestures, which has an important significance in the fields of Human-Machine Interaction (HMI) and Virtual Reality (VR). In recent years, 3D hand pose estimation from a single depth image has made great research achievements due to the development of depth cameras. However, 3D hand pose estimation from a single RGB image is still a highly challenging problem. In this work, we propose a novel four-stage cascaded hierarchical CNN (4CHNet), which leverages hierarchical network to decompose hand pose estimation into finger pose estimation and palm pose estimation, extracts separately finger features and palm features, and finally fuses them to estimate 3D hand pose. Compared with direct estimation methods, the hand feature information extracted by the hierarchical network is more representative. Furthermore, concatenating various stages of the network for end-to-end training can make each stage mutually beneficial and progress. The experimental results on two public datasets demonstrate that our 4CHNet can significantly improve the accuracy of 3D hand pose estimation from a single RGB image.
- Research Article
4
- 10.1177/00405175221118105
- Aug 15, 2022
- Textile Research Journal
Hyperspectral images are capable of significantly increasing the accuracy of textile color measurement because of their rich information. However, hyperspectral imaging generally requires expensive equipment and complex operations. If the hyperspectral information can be reconstructed based on a single RGB image, it can facilitate the widespread application of hyperspectral imaging technology, such as in textile color measurement. In this paper, a deep learning model was proposed for hyperspectral reconstruction of cotton and linen fabrics based on the conditional generative adversarial network. According to this model, the encoder–decoder structure and spatial pyramid convolution pooling operation were adopted to fuse multi-scale features for the prevention of mode collapse. Atrous convolution was introduced to increase the receptive field to adapt to the fabric texture information, and the hyperspectral information of the fabric from a single RGB image was reconstructed. The quantitative and qualitative tests verified that the method in this paper had good results. The root mean square error and peak signal-to-noise ratio were 0.0271 and 31.372, respectively, for reconstructed fabric hyperspectral images; the highest average color difference [Formula: see text] in the reconstructed hyperspectral colorimetry experiment was obtained as 2.755. Thus, the proposed method can meet the common application requirements of color measurement.
- Research Article
37
- 10.1016/j.patcog.2022.108762
- Apr 30, 2022
- Pattern Recognition
3D hand pose and shape estimation from RGB images for keypoint-based hand gesture recognition
- Research Article
1
- 10.3390/app13020693
- Jan 4, 2023
- Applied Sciences
Estimating the three-dimensional (3D) pose of real objects using only a single RGB image is an interesting and difficult topic. This study proposes a new pipeline to estimate and represent the pose of an object in an RGB image only with the 4-DoF annotation to a matching CAD model. The proposed method retrieves CAD candidates from the ShapeNet dataset and utilizes the pose-constrained 2D renderings of the candidates to find the best matching CAD model. The pose estimation pipeline consists of several steps of learned networks followed by image similarity measurements. First, from a single RGB image, the category and the object region are determined and segmented. Second, the 3-DoF rotational pose of the object is estimated by a learned pose-contrast network only using the segmented object region. Thus, 2D rendering images of CAD candidates are generated based on the rotational pose result. Finally, an image similarity measurement is performed to find the best matching CAD model and to determine the 1-DoF focal length of the camera to align the model with the object. Conventional pose estimation methods employ the 9-DoF pose parameters due to the unknown scale of both image object and CAD model. However, this study shows that only 4-DoF annotation parameters between real object and CAD model is enough to facilitates the projection of the CAD model to the RGB space for image-graphic applications such as Extended Reality. In the experiments, performance of the proposed method is analyzed by using ground truth and comparing with a triplet-loss learning method.
- Conference Article
6
- 10.1109/iros40897.2019.8968566
- Nov 1, 2019
In this paper, we present a novel method to predict 3D TSDF voxels from a single image for dense 3D reconstruction. 3D reconstruction with RGB images has two inherent problems: scale ambiguity and sparse reconstruction. With the advent of deep learning, depth prediction from a single RGB image has addressed these problems. However, as the predicted depth is typically noisy, de-noising methods such as TSDF fusion should be adapted for the accurate scene reconstruction. To integrate the two-step processing of depth prediction and TSDF generation, we design an RGB-to-TSDF network to directly predict 3D TSDF voxels from a single RGB image. The TSDF using our network can be generated more efficiently in terms of time and accuracy than the TSDF converted from depth prediction. We also use the predicted TSDF for a more accurate and robust camera pose estimation to complete scene reconstruction. The global TSDF is updated from TSDF prediction and pose estimation, and thus dense isosurface can be extracted. In the experiments, we evaluate our TSDF prediction and camera pose estimation results against the conventional method.
- Research Article
5
- 10.1109/lsp.2020.3033462
- Jan 1, 2020
- IEEE Signal Processing Letters
This work addresses a challenging problem of estimating the full 3D human shape and pose from monocular videos. Since real-world 3D mesh-labeled datasets are limited, most current methods in 3D human shape reconstruction only focus on single RGB images, losing all the temporal information. In contrast, we propose temporally refined Graph U-Nets, including an image-level module and a video-level module, to solve this problem. The image-level module is Graph U-Nets for human shape and pose estimation from images, where the Graph Convolutional Neural Network (Graph CNN) helps the information communication of neighboring vertices, and the U-Nets architecture enlarges the receptive field of each vertex and fuses high-level and low-level features. The video-level module is a small Residual Temporal Graph CNN (Residual TG-CNN), which learns temporal dynamics from both structural and temporal neighbors. The temporal dynamics of each vertex are continuous in the temporal dimension and highly relevant to the structural neighbors, so it is helpful to diminish the ambiguity of the body in single images by fusing temporal dynamics. Our algorithm makes full use of labels from image-level datasets and refines the image-level results through video-level module. Evaluated on Human3.6 M and 3DPW datasets, our model produces accurate 3D human meshes and achieves superior 3D human pose estimation accuracy when compared with state-of-the-art methods.
- Conference Article
189
- 10.1109/cvpr.2019.00116
- Jun 1, 2019
Estimating 3D hand meshes from single RGB images is challenging, due to intrinsic 2D-3D mapping ambiguities and limited training data. We adopt a compact parametric 3D hand model that represents deformable and articulated hand meshes. To achieve the model fitting to RGB images, we investigate and contribute in three ways: 1) Neural rendering: inspired by recent work on human body, our hand mesh estimator (HME) is implemented by a neural network and a differentiable renderer, supervised by 2D segmentation masks and 3D skeletons. HME demonstrates good performance for estimating diverse hand shapes and improves pose estimation accuracies. 2) Iterative testing refinement: Our fitting function is differentiable. We iteratively refine the initial estimate using the gradients, in the spirit of iterative model fitting methods like ICP. The idea is supported by the latest research on human body. 3) Self-data augmentation: collecting sized RGB-mesh (or segmentation mask)-skeleton triplets for training is a big hurdle. Once the model is successfully fitted to input RGB images, its meshes i.e. shapes and articulations, are realistic, and we augment view-points on top of estimated dense hand poses. Experiments using three RGB-based benchmarks show that our framework offers beyond state-of-the-art accuracy in 3D pose estimation, as well as recovers dense 3D hand shapes. Each technical component above meaningfully improves the accuracy in the ablation study.
- Research Article
19
- 10.3390/app13137611
- Jun 27, 2023
- Applied Sciences
Human pose estimation refers to accurately estimating the position of the human body from a single RGB image and detecting the location of the body. It serves as the basis for several computer vision tasks, such as human tracking, 3D reconstruction, and autonomous driving. Improving the accuracy of pose estimation has significant implications for the advancement of computer vision. This paper addresses the limitations of single-branch networks in pose estimation. It presents a top-down single-target pose estimation approach based on multi-branch self-calibrating networks combined with graph convolutional neural networks. The study focuses on two aspects: human body detection and human body pose estimation. The human body detection is for athletes appearing in sports competitions, followed by human body pose estimation, which is divided into two methods: coordinate regression-based and heatmap test-based. To improve the accuracy of the heatmap test, the high-resolution feature map output from HRNet is used for deconvolution to improve the accuracy of single-target pose estimation recognition.
- Research Article
4
- 10.1016/j.cag.2022.11.010
- Nov 25, 2022
- Computers & Graphics
Robust and automatic clothing reconstruction based on a single RGB image
- Research Article
31
- 10.1109/tcsvt.2020.3004453
- Jun 23, 2020
- IEEE Transactions on Circuits and Systems for Video Technology
Hand pose estimation in 3D space from a single RGB image is a highly challenging problem due to self-geometric ambiguities, diverse texture, viewpoints, and self-occlusions. Existing work proves that a network structure with multi-scale resolution subnets, fused in parallel can more effectively shows the spatial accuracy of 2D pose estimation. Nevertheless, the features extracted by traditional convolutional neural networks cannot efficiently express the unique topological structure of hand key points based on discrete and correlated properties. Some applications of hand pose estimation based on traditional convolutional neural networks have demonstrated that the structural similarity between the graph and hand key points can improve the accuracy of the 3D hand pose regression. In this paper, we design and implement an end-to-end network for predicting 3D hand pose from a single RGB image. We first extract multiple feature maps from different resolutions and make parallel feature fusion, and then model a graph-based convolutional neural network module to predict the initial 3D hand key points. Next, we use 2D spatial relationships and 3D geometric knowledge to build a self-supervised module to eliminate domain gaps between 2D and 3D space. Finally, the final 3D hand pose is calculated by averaging the 3D hand poses from the GCN output and the self-supervised module output. We evaluate the proposed method on two challenging benchmark datasets for 3D hand pose estimation. Experimental results show the effectiveness of our proposed method that achieves state-of-the-art performance on the benchmark datasets.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.