This paper addresses the problem of recognizing multiple objects and multiple instances from point clouds. Whereas existing methods utilize descriptors on 3D fields or pointwise voting to achieve this task, our framework takes advantage of both descriptor-based and voting-based schemes to realize more robust and efficient prediction. Specifically, we propose a novel and robust descriptor called an orientation-enhanced fast point feature histogram (OE-FPFH) to describe points in both the object model and scene, and further to build the correspondence set. The OE-FPFH integrates an orientation vector through mining the geometric tensor of the local structure of a surface point, which is more representative than the original FPFH descriptor. To improve voting efficiency, we devise a novel single-point voting mechanism (SPVM), which constructs a unique local reference frame (LRF) on a single point using the orientation vector. The SPVM takes as input the corresponding point set and can generate a pose candidate for each correspondence. The process is realized by matching LRFs from two corresponding points. All pose candidates are subsequently divided into clusters and aggregated using the K-means clustering algorithm to deduce the poses for different objects or instances in the scene. Experiments on three challenging datasets demonstrate that our method is effective, efficient, and robust to occlusions and multiple instances.