Abstract Vehicle recognition technology is widely applied in automatic parking, traffic restrictions, and public security investigations, playing a significant role in the construction of intelligent transportation systems. Fine-grained vehicle recognition seeks to surpass conventional vehicle recognition by concentrating on more detailed sub-classifications. This task is more challenging due to the subtle inter-class differences and significant intra-class variations. Localization-classification subnetworks represent an efficacious approach frequently employed for this task, but previous research has typically relied on CNN deep feature maps for object localization, which suffer from the low resolution, leading to poor localization accuracy. The multi-layer feature fusion localization method proposed by us fuses the high-resolution feature map of the shallow layer of CNN with the deep feature map, and makes full use of the rich spatial information of the shallow feature map to achieve more precise object localization. In addition, traditional methods acquire local attention information through the design of complex models, frequently resulting in regional redundancy or information omission. To address this, we introduce an attention module that adaptively enhances the expressiveness of global features and generates global attention features. These global attention features are then integrated with object-level features and local attention cues to achieve a more comprehensive attention enhancement. Lastly, we devise a multi-branch model and employ the aforementioned object localization and attention enhancement methods for end-to-end training to make the multiple branches collaborate seamlessly to adequately extract fine-grained features. Extensive experiments conducted on the Stanford Cars dataset and the self-built Cars-126 dataset have demonstrated the effectiveness of our method, achieving a leading position among existing methods with 97.7% classification accuracy on the Stanford Cars dataset.