A 3D object detection and pose estimation pipeline using RGB-D images
3D object detection and pose estimation has been studied extensively in recent decades for its potential applications in robotics. However, there still remains challenges when we aim at detecting multiple objects while retaining low false positive rate in cluttered environments. This paper proposes a robust 3D object detection and pose estimation pipeline based on RGB-D images, which can detect multiple objects simultaneously while reducing false positives. Detection begins with template matching and yields a set of template matches. A clustering algorithm then groups templates of similar spatial location and produces multiple-object hypotheses. A scoring function evaluates the hypotheses using their associated templates and non-maximum suppression is adopted to remove duplicate results based on the scores. Finally, a combination of point cloud processing algorithms are used to compute objects' 3D poses. Existing object hypotheses are verified by computing the overlap between model and scene points. Experiments demonstrate that our approach provides competitive results comparable to the state-of-the-arts and can be applied to robot random bin-picking.
- Conference Article
5
- 10.1109/etfa.2019.8869384
- Sep 1, 2019
In order to improve the robot’s perception ability in the complicated environment, especially the unstructured environment, a framework of 3D object detection and pose estimation using single shot detector (SSD) and modified LineMOD template matching is proposed, which can detect multiple objects and estimate their pose simultaneously. Firstly, the initial object detection (the first detection) is realized by single shot detector network and therefore the region of interest (RoI) of target objects are generated. LineMOD template matching is then applied to provide candidate templates. These calculated templates are grouped by the designed clustering algorithm. After sorting the clusters according to the descending order of the average similarity, non-maximum suppression removes the similar results and provide the further multiple detection results (the second detection). Finally, based on the results from the second detection, the pose of the object is estimated by using iterative closest point (ICP) algorithm. The object detection experiments show that on Tejani dataset, the average recognition rate of six objects reaches 99.25%. For the object pose estimation, F1 of the proposed method is 21.7% higher than the conventional method in the pose estimation experiments. Also, F1 of the presented algorithm is 9.5% higher than Deep-6Dpose method. Both comparison experiments verify the effectiveness of the proposed framework. Further, this framework for object detection and pose estimation is employed to do robotic grasping. In particular, the workpiece of steel plates is grabbed, which is a necessary procedure of the polishing technique.
- Conference Article
5
- 10.1109/yac.2017.7967493
- May 1, 2017
Robust 3D object detection and pose estimation is still a big challenging for robot vision. In this paper, we propose a new framework for 3D object detection and pose estimation. Rather than using RGB-D image as the original data, we propose to use volumetric representation with the help of unsupervised deep learning network to extract low dimensional feature from 3D point cloud directly. The volumetric representation can not only eliminate the dense scale sampling for offline model training, but also reduce the distortion by mapping the 3D shape to 2D plane and overcome the dependence on texture information. Depending on the Hough forest, we can achieve multi-object detection and pose estimation simultaneously. In compare with the state-of-the-arts using public datasets, we justify the effectiveness of our proposed method.
- Conference Article
2
- 10.1145/1968613.1968658
- Feb 21, 2011
The next generation of service robots is to offer services relying heavily on visually guided manipulation, besides navigation, such services as errand, logistics, appliance, home keeping, etc. For the successful introduction of these services, it is critical to establish the consumer-level dependability in 3D object recognition and pose estimation in a natural setting where a large variation of environment, e.g., perspective, texture, form factor, illumination, occlusion, etc., is common. To address this problem, we propose an approach of the two-layered particle filter to the dependability in 3D object recognition and pose estimation. In the upper layer, a set of object pose candidates is identified and maintained in the search space as a set of super-particles, each of which is assigned a probability of the true pose and evolved in time along with the accumulation of further evidences. To define the object pose candidates, first, we acquire initially weak evidences quickly and interpret them in terms of possible object poses in space. These interpretations serve as the region of interest for detailed investigation by which the pose probabilities are computed for individual interpretations based on the likelihood and unlikelihood of various features available in the corresponding regions of interest. During the process of probability computation, we select the object pose candidates to be used as super-particles in the upper layer. In the lower level, the pose uncertainties associated with the individual candidates are represented as particles that are subject to the propagation in time. Finally, the experimental results support the strength of the proposed approach in the real environment in terms of its dependability in 3D object recognition and pose estimation.
- Research Article
22
- 10.1109/tcsvt.2019.2929600
- Jul 30, 2019
- IEEE Transactions on Circuits and Systems for Video Technology
In this paper, a framework is proposed for object recognition and pose estimation from color images using convolutional neural networks (CNNs). 3D object pose estimation along with object recognition has numerous applications, such as robot positioning versus a target object and robotic object grasping. Previous methods addressing this problem relied on both color and depth (RGB-D) images to learn low-dimensional viewpoint descriptors for object pose retrieval. In the proposed method, a novel quaternion-based multi-objective loss function is used, which combines manifold learning and regression to learn 3D pose descriptors and direct 3D object pose estimation, using only color (RGB) images. The 3D object pose can then be obtained either by using the learned descriptors in the nearest neighbor (NN) search or by direct neural network regression. An extensive experimental evaluation has proven that such descriptors provide greater pose estimation accuracy than the state-of-the-art methods. In addition, the learned 3D pose descriptors are almost object-independent and, thus, generalizable to unseen objects. Finally, when the object identity is not of interest, the 3D object pose can be regressed directly from the network, by overriding the NN search, thus significantly reducing the object pose inference time.
- Book Chapter
3
- 10.1007/978-3-319-65292-4_4
- Jan 1, 2017
3D object detection and pose estimation based on 3D sensor have been widely studied for its applications in robotics. In this paper, we propose a new clustering strategy in Point Pair Feature (PPF) based 3D object detection and pose estimation framework to further improve the pose hypothesis result. Our main contribution is using Density Based Spatial Clustering of Applications with Noise (DBSCAN) and Principle Component Analysis (PCA) in PPF method. It was recently shown that point pair feature combined with a voting framework was able to obtain a fast and robust pose estimation result in heavily cluttered scenes with occlusions. However, this method may fail in the mismatching region caused by false features or features with insufficient information. Our experimental results show that the proposed method can detect mismatching region and false pose hypotheses in PPF method, which improves the performance in robot bin picking application.
- Conference Article
996
- 10.1109/wacv.2014.6836101
- Mar 1, 2014
3D object detection and pose estimation methods have become popular in recent years since they can handle ambiguities in 2D images and also provide a richer description for objects compared to 2D object detectors. However, most of the datasets for 3D recognition are limited to a small amount of images per category or are captured in controlled environments. In this paper, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 [4] with 3D annotations. Furthermore, more images are added for each category from ImageNet [3]. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area. We provide the results of variations of DPM [6] on our new dataset for object detection and viewpoint estimation in different scenarios, which can be used as baselines for the community. Our benchmark is available online at http://cvgl.stanford.edu/projects/pascal3d.
- Research Article
54
- 10.1109/access.2018.2808225
- Jan 1, 2018
- IEEE Access
Object recognition and pose estimation are essential functions in applications of computer vision, and they also are fundamental modules in robotic vision systems. In recent years, RGB-D cameras become more and more popular, and the 3D object recognition technology has got more and more attention. In this paper, a novel design of simultaneous 3D object recognition and pose estimation algorithm is proposed based on RGB-D images. The proposed system converts the input RGB-D image to colored point cloud data and extracts features of the scene from the colored point cloud. Then, the existing color signature of histograms of orientations (CSHOT) description algorithm is employed to build descriptors of the detected features based on local texture and shape information. Given the extracted feature descriptors, a two-stage matching process is performed to find correspondences between the scene and a colored point cloud model of an object. Next, a Hough voting algorithm is used to filter out matching errors in the correspondence set and estimate the initial 3D pose of the object. Finally, the pose estimation stage employs RANdom SAmple Consensus (RANSAC) and hypothesis verification algorithms to refine the initial pose and filter out poor estimation results with error hypotheses. Experimental results show that the proposed system not only successfully recognizes the object in a complex scene but also accurately estimates the 3D pose information of the object with respect to the camera.
- Research Article
64
- 10.1016/j.rcim.2020.102086
- Nov 8, 2020
- Robotics and Computer-Integrated Manufacturing
Semantic part segmentation method based 3D object pose estimation with RGB-D images for bin-picking
- Research Article
- 10.4036/iis.2025.a.05
- Jan 1, 2025
- Interdisciplinary Information Sciences
Bin-picking is a problem of an object to be automatically picked up from a randomly stacked pile. When considering the complex light reflection scenes, Light Transport Matrix (LTM) estimation based 3D measurement method achieves high accuracy and robustness; however, it is computationally expensive. To achieve the bin-picking such a real-time application for complex light reflection scenes, we propose a new learning-based 3D object recognition and pose estimation method. We leverage a neural network for learning features of point clouds in order to detect and estimate 3D position of the object. We develop a deep learning model which is trained by using the synthetic point cloud data. The key idea of our method is to separate translation estimation and rotation estimation, and introduce the attention mechanism to aggregate the pair-wise feature and the point-wise feature. We train the network using the dataset from a simulation, and test this trained network on the real scene. We also integrate the LTM estimation-based 3D measurement and proposed object detection and pose estimaition with a robot system to achieve the bin-picking task.
- Research Article
24
- 10.1109/tip.2020.3025447
- Jan 1, 2020
- IEEE Transactions on Image Processing
Synthetic 3D object models have been proven crucial in object pose estimation, as they are utilized to generate a huge number of accurately annotated data. The object pose estimation problem is usually solved for images originating from the real data domain by employing synthetic images for training data enrichment, without fully exploiting the fact that synthetic and real images may have different data distributions. In this work, we argue that 3D object pose estimation problem is easier to solve for images originating from the synthetic domain, rather than the real data domain. To this end, we propose a 3D object pose estimation framework consisting of a two-step process, where a novel pose-oriented image-to-image translation step is first employed to translate noisy real images to clean synthetic ones and then, a 3D object pose estimation method is applied on the translated synthetic images to finally predict the 3D object poses. A novel pose-oriented objective function is employed for training the image-to-image translation network, which enforces that pose-related object image characteristics are preserved in the translated images. As a result, the pose estimation network does not require real data for training purposes. Experimental evaluation has shown that the proposed framework greatly improves the 3D object pose estimation performance, when compared to state-of-the-art methods.
- Research Article
6
- 10.3390/s21041299
- Feb 11, 2021
- Sensors (Basel, Switzerland)
Deep learning has achieved great success on robotic vision tasks. However, when compared with other vision-based tasks, it is difficult to collect a representative and sufficiently large training set for six-dimensional (6D) object pose estimation, due to the inherent difficulty of data collection. In this paper, we propose the RobotP dataset consisting of commonly used objects for benchmarking in 6D object pose estimation. To create the dataset, we apply a 3D reconstruction pipeline to produce high-quality depth images, ground truth poses, and 3D models for well-selected objects. Subsequently, based on the generated data, we produce object segmentation masks and two-dimensional (2D) bounding boxes automatically. To further enrich the data, we synthesize a large number of photo-realistic color-and-depth image pairs with ground truth 6D poses. Our dataset is freely distributed to research groups by the Shape Retrieval Challenge benchmark on 6D pose estimation. Based on our benchmark, different learning-based approaches are trained and tested by the unified dataset. The evaluation results indicate that there is considerable room for improvement in 6D object pose estimation, particularly for objects with dark colors, and photo-realistic images are helpful in increasing the performance of pose estimation algorithms.
- Research Article
26
- 10.1016/j.imavis.2025.105437
- Mar 1, 2025
- Image and Vision Computing
3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human–robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models’ performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose. Most existing reviews focus mainly on monocular 3D human pose estimation and a comprehensive survey only on multi-view approaches to determine the 3D pose has been missing since 2012. Thus, the goal of this survey is to fill that gap and present an overview of the methodologies related to 3D pose estimation in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. According to the reviewed articles, it was possible to find that most methods are fully-supervised approaches based on geometric constraints. Nonetheless, most of the methods suffer from 2D pose mismatches, to which the incorporation of temporal consistency and depth information have been suggested to reduce the impact of this limitation, besides working directly with 3D features can completely surpass this problem but at the expense of higher computational complexity. Models with lower supervision levels were identified to overcome some of the issues related to 3D pose, particularly the scarcity of labelled datasets. Therefore, no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task. • First review only on multi-view, multi-modal methods to estimate 3D pose since 2012. • Multi-view allows capturing the full body geometry, making 3D pose estimation easier. • Real-world applications include sports, broadcasting, rehabilitation or animation. • Finding a fast, accurate method with low computational cost remains a challenge. • Multi-modal methods or view selection can lead to an efficient and effective model.
- Conference Article
144
- 10.1109/iccv.2011.6126342
- Nov 1, 2011
This paper addresses view-invariant object detection and pose estimation from a single image. While recent work focuses on object-centered representations of point-based object features, we revisit the viewer-centered framework, and use image contours as basic features. Given training examples of arbitrary views of an object, we learn a sparse object model in terms of a few view-dependent shape templates. The shape templates are jointly used for detecting object occurrences and estimating their 3D poses in a new image. Instrumental to this is our new mid-level feature, called bag of boundaries (BOB), aimed at lifting from individual edges toward their more informative summaries for identifying object boundaries amidst the background clutter. In inference, BOBs are placed on deformable grids both in the image and the shape templates, and then matched. This is formulated as a convex optimization problem that accommodates invariance to non-rigid, locally affine shape deformations. Evaluation on benchmark datasets demonstrates our competitive results relative to the state of the art.
- Conference Article
5
- 10.23919/mva.2017.7986888
- May 1, 2017
Due to the recent development of industrial automation, some applications have been improved with computer vision techniques. One important task is to recognize and estimate the 3D pose of the object in the scene. In this work, we use a depth camera to capture the 3D information of a scene, and proposed a 3D pose estimation algorithm. A main difficulty of the 3D object recognition and pose estimation is the captured data may have noise from the environment light, shadow or sensors. In general, the reference model and target model are captured from the same depth camera, so they will have similar data structures. However, in our work, we consider the target model generated from Computer-Aided-Design, and the reference model is captured from the depth camera. The data from different sources will cause the estimation error. In this work, we have addressed this problem. Finally, we develop the simulation system for our proposed method, and also simulate a manipulator to accomplish the pick-and-place task.
- Research Article
15
- 10.1007/s11370-023-00468-4
- Jun 20, 2023
- Intelligent Service Robotics
Simultaneous object recognition and pose estimation are two key functionalities for robots to safely interact with humans as well as environments. Although both object recognition and pose estimation use visual input, most state of the art tackles them as two separate problems since the former needs a view-invariant representation, while object pose estimation necessitates a view-dependent description. Nowadays, multi-view convolutional neural network (MVCNN) approaches show state-of-the-art classification performance. Although MVCNN object recognition has been widely explored, there has been very little research on multi-view object pose estimation methods, and even less on addressing these two problems simultaneously. The pose of virtual cameras in MVCNN methods is often pre-defined in advance, leading to bound the application of such approaches. In this paper, we propose an approach capable of handling object recognition and pose estimation simultaneously. In particular, we develop a deep object-agnostic entropy estimation model, capable of predicting the best viewpoints of a given 3D object. The obtained views of the object are then fed to the network to simultaneously predict the pose and category label of the target object. Experimental results showed that the views obtained from such positions are descriptive enough to achieve a good accuracy score. Furthermore, we designed a real-life serve drink scenario to demonstrate how well the proposed approach worked in real robot tasks. Code is available online at: https://github.com/SubhadityaMukherjee/more_mvcnn.