Knowledge distillation is a process of weight compression where a complex teacher model trains a simplified student model, making the weights and their distribution crucial. This paper investigates the weight distribution in convolutional and fully connected layers of both teacher and student models. For convolutional layers, it was discovered that both teacher and student models exhibit a piecewise power-law distribution. A verification method based on the piecewise power law distribution of convolutional layers was proposed, and the correctness of this law was confirmed. Detailed analysis of the breakpoints and power exponents reveals that the teacher model has smaller breakpoints than the student model; for weights smaller than the breakpoint, the teacher model’s power exponent is lower than that of the student model, whereas, for weights larger than the breakpoint, the teacher model’s power exponent is higher. Based on these findings, a new weight initialization algorithm for convolutional layers was proposed. For fully connected layers, both models demonstrate a skewed distribution. A verification method based on the skewed distribution of fully connected layers was proposed, and the correctness of this law was confirmed. Analysis of the kurtosis and skewness indicates that the student model exhibits higher kurtosis and skewness than the teacher model. Based on these observations, a new weight initialization algorithm for fully connected layers was proposed. Experimental results show that both initialization methods improve the initial and final accuracy of the student model compared to the He initialization method.
Read full abstract