Empirical Research On Quantization For 3D Multi-Modal Vit Models
Model quantization finds success in simplifying model inference in practical applications. However, it predominantly focuses on CNNs and 2D ViT models, with limited attention given to quantizing 3D models. We pensively explore 3D model quantization challenges and discover similar numerical distributions of Softmax and LayerNorm between 3D and 2D models. Consequently, we apply the quantization algorithms FQ-ViT and I-ViT designed for 2D ViT models to 3D model quantization to address performance issues caused by uneven numerical distributions in Softmax and LayerNorm. Our research includes extensive experiments using transformer architectures and establishes benchmarks, demonstrating successful quantization of 3D multimodal model UNITR. Notably, our approach experiences a slight decrease compared to FP32 while outperforming other state-of-the-art models. For example, in the 3D object detection task on the nuScenes dataset, the 8 -bit UNITR (FQ-ViT) achieves impressive NDS and mAP scores of $73.0 \%$ and $70.0 \%$, surpassing the full precision BEVFusion model.