Currently, multi-view fusion methods fuse point- or proposal-level features from different views at the end stage of the backbone. This once-end-fusion method is not conducive to the timely adjustment of spatial misalignment for features from different views. Consequently, the discriminative depth and orientation details of the 3D oriented point cloud object may be filtered. To enhance the feature capture capability of the network, we introduce a cascaded multi-3D-view fusion method (CM3DV) to learn the implicit representation of object orientation. In particular, the proposed CM3DV method incorporates the cylindrical front view projection into a voxelised 3D bird’s-eye-view representation in a cascaded manner, and vice versa. Through the learning of 3D-regulated instance representation, this bi-directional mutual fusion module, called cascaded multi-view feature fusion module, alleviates the spatial misalignment of the two views. Furthermore, to learn the rotation- and shape-invariant features of objects, modulated rotation head (MRH) develops a direction-guided adjustment instead of an axis-aligned structure to extract instance-consistent features. By excluding the irrelevant content using MRH, this instance-consistent feature will benefit the object classification and orientation regression. Extensive experiments on the KITTI dataset show that the proposed method achieves a significant improvement over existing advanced methods, especially for orientation estimation.