Arbitrary bit-width network quantization has received significant attention due to its high adaptability to various bit-width requirements during runtime. However, in this paper, we investigate existing methods and observe a significant accumulation of quantization errors caused by switching weight and activations bit-widths, leading to limited performance. To address this issue, we propose MBQuant, a novel method that utilizes a multi-branch topology for arbitrary bit-width quantization. MBQuant duplicates the network body into multiple independent branches, where the weights of each branch are quantized to a fixed 2-bit and the activations remain in the input bit-width. For completing the computation of a desired bit-width, MBQuant selects multiple branches, ensuring that the computational costs match those of the desired bit-width, to carry out forward propagation. By fixing the weight bit-width, MBQuant substantially reduces quantization errors caused by switching weight bit-widths. Additionally, we observe that the first branch suffers from quantization errors caused by all bit-widths, leading to performance degradation. Thus, we introduce an amortization branch selection strategy that amortizes the errors. Specifically, the first branch is selected only for certain bit-widths, rather than universally, thereby the errors are distributed among the branches more evenly. Finally, we adopt an in-place distillation strategy that uses the largest bit-width to guide the other bit-widths to further enhance MBQuant’s performance. Extensive experiments demonstrate that MBQuant achieves significant performance gains compared to existing arbitrary bit-width quantization methods. Code is made publicly available at https://github.com/zysxmu/MBQuant.
Read full abstract