Abstract

Recently, as the visual data captured by smart terminal devices is mainly served for machine analysis, e.g., image classification, object detection, and sometimes also for human viewing in special scenarios, a new compression framework should be proposed to meet this demand. In this work, an image coding framework for machine analysis and human viewing simultaneously is proposed, termed STSIC. In order to leverage the relationship between the two tasks, STSIC is built upon a scalable structure, comprising a base layer for machine analysis and image preview at low quality, as well as an enhancement layer for high-quality image reconstruction. More specifically, features are extracted through the backbone and transmitted to the cloud side for tasks on the base layer. Different from the vast majority of existing works, we try to employ the Swin-Transformer as the backbone instead of a convolutional neural network (CNN). Additionally, due to the hierarchical structure of the Swin-Transformer, a feature fusion strategy is applied to generate a common feature that is more friendly to both human and machine tasks. The enhancement layer transfers the residuals between the original images and the reconstructed images from the base layer, thereby improving the quality of image reconstruction for human viewing. The experimental results show that STSIC outperforms relevant benchmarks on object detection, especially under low bitrates. As for image reconstruction, the advantage is weakened, but it still surpasses HEVC on the rate-distortion performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call