<p>Deep learning models have been successfully applied to many visual tasks. However, they tend to be increasingly cumbersome due to their high computational complexity and large storage requirements. How to compress convolutional neural network (CNN) models while still maintain their efficiency has received increasing attention from the community, and knowledge distillation (KD) is efficient way to do this. Existing KD methods have focused on the selection of good teachers from multiple teachers, or KD layers, which is cumbersome, expensive computationally, and requires large neural networks for individual models. Most of teacher and student modules are CNN-based networks. In addition, recent proposed KD methods have utilized cross entropy (CE) loss function at student network and KD network. This research focuses on the quantifiable evaluation of teacher-student model, in which knowledge is not only distilled from training models that have the same CNN architecture but also from different architectures. Furthermore, we propose combination of CE, balance cross entropy (BCE), and focal loss functions to not only soften the value of loss function in transferring knowledge from large teacher model to small student model but also increase classification performance. The proposed solution is evaluated on four benchmark static image datasets, and the experimental results show that our proposed solution outperforms the state-of-the-art (SOTA) methods from 2.67% to 9.84% at top 1 accuracy.</p>