Artificial intelligence at the edge can help solve complex tasks faced by various sectors such as automotive, healthcare and surveillance. However, challenged by the lack of computational power from the edge devices, artificial intelligence models are forced to adapt. Many have developed and quantified model compres-sion approaches over the years to tackle this problem. However, not many have considered the overhead of on-device model compression, even though model compression can take a considerable amount of time. With the added metric, we provide a more complete view on the efficiency of model compression on the edge. The objective of this research is identifying the benefit of compression methods and it’s tradeoff between size and latency reduction versus the accuracy loss as well as compression time in edge devices. In this work, quantitative method is used to analyze and rank three common ways of model compression: post-training quantization, unstructured pruning and knowledge distillation on the basis of accuracy, latency, model size and time to compress overhead. We concluded that knowledge distillation is the best, with potential of up to 11.4x model size reduction, and 78.67% latency speed up, with moderate loss of accura-cy and compression time.
Read full abstract