3D object detectors based on LiDAR have been extensively used in autonomous and robotic systems. Efficient voxel-based models must downsample their feature space to reduce computation, which leads to the loss of geometric information and limit their accuracy. To solve this problem, this paper presents a 3D detection framework, point-voxel and bird’s-eye-view representation aggregation network for single stage 3D object detection (PVB-SSD), in which a position information input branch generates Fourier embedding features from the origin point cloud to supplement the lost information. A global-former module integrates embedded Fourier features with bird’s-eye-view features extracted by a 3D convolution backbone. Considering that in the deeper layer of the neural network, the spatial level features will be replaced by semantic level features, a windows transformer spatial-semantic aggregate module fuses them dynamically. Extensive experiments on the KITTI, Waymo and NuScences datasets show that our model has excellent accuracy and relatively low computational consumption.