End-to-End Learned Scalable Multilayer Feature Compression For Machine Vision Tasks
In the field of Video Coding for Machines (VCM), scalable feature compression has attracted attention for its potential to support a variety of machine vision tasks. However, the existing scalable feature compression methods exhibit limited performance. To address this problem, we propose an end-to-end learned scalable multilayer feature compression method in this paper. First, we propose to leverage an end-to-end feature compression method, which can efficiently exploit redundancy among features through a learning approach, to improve compression efficiency. Second, we introduce a novel strategy involving the use of the transformed latent of the base layer as the conditional information for the enhancement layer. Given the learnable nature of our compression method, we propose to optimize the base layer and the enhancement layer jointly. The joint optimization encourages the base layer to produce more suitable conditional information for the enhancement layer. Comparative experiments against existing feature compression and image compression methods verify our approach’s remarkable performance improvements.
- Research Article
81
- 10.1109/tmm.2021.3068580
- Jan 1, 2021
- IEEE Transactions on Multimedia
The past decades have witnessed the rapid development of image and video coding techniques in the era of big data. However, the signal fidelity-driven coding pipeline design limits the capability of the existing image/video coding frameworks to fulfill the needs of both machine and human vision. In this paper, we come up with a novel face image coding framework by leveraging both the compressive and the generative models, to support machine vision and human perception tasks jointly. Given an input image, the feature analysis is first applied, and then the generative model is employed to reconstruct image with compact structure and color features, where sparse edges are extracted to connect both kinds of vision and a key reference pixel selection method is proposed to determine the priorities of the reference color pixels for scalable coding. The compact edge map serves as the basic layer for machine vision tasks, and the reference pixels act as an enhanced layer to guarantee signal fidelity for human vision. By introducing advanced generative models, we train a decoding network to reconstruct images from compact structure and color representations, which is flexible to accept inputs in a scalable way and to control the imagery effect of the outputs between signal fidelity and visual realism. Experimental results and comprehensive performance analysis over the face image dataset demonstrate the superiority of our framework in both human vision tasks and machine vision tasks, which provide useful evidence on the emerging standardization efforts on MPEG VCM (Video Coding for Machine).
- Research Article
44
- 10.1109/tpami.2024.3367293
- Jul 1, 2024
- IEEE transactions on pattern analysis and machine intelligence
As an emerging research practice leveraging recent advanced AI techniques, e.g. deep models based prediction and generation, Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression, and attempts to optimize compactness and efficiency jointly from a unified perspective of high accuracy machine vision and full fidelity human vision. With the rapid advances of deep feature representation and visual data compression in mind, in this paper, we summarize VCM methodology and philosophy based on existing academia and industrial efforts. The development of VCM follows a general rate-distortion optimization, and the categorization of key modules or techniques is established including feature-assisted coding, scalable coding, intermediate feature compression/optimization, and machine vision targeted codec, from broader perspectives of vision tasks, analytics resources, etc. From previous works, it is demonstrated that, although existing works attempt to reveal the nature of scalable representation in bits when dealing with machine and human vision tasks, there remains a rare study in the generality of low bit rate representation, and accordingly how to support a variety of visual analytic tasks. Therefore, we investigate a novel visual information compression for the analytics taxonomy problem to strengthen the capability of compact visual representations extracted from multiple tasks for visual analytics. A new perspective of task relationships versus compression is revisited. By keeping in mind the transferability among different machine vision tasks (e.g. high-level semantic and mid-level geometry-related), we aim to support multiple tasks jointly at low bit rates. In particular, to narrow the dimensionality gap between neural network generated features extracted from pixels and a variety of machine vision features/labels (e.g. scene class, segmentation labels), a codebook hyperprior is designed to compress the neural network-generated features. As demonstrated in our experiments, this new hyperprior model is expected to improve feature compression efficiency by estimating the signal entropy more accurately, which enables further investigation of the granularity of abstracting compact features among different tasks.
- Conference Article
21
- 10.1109/icpr56361.2022.9956532
- Aug 21, 2022
There is a growing number of images that are analyzed by machines rather than just humans. Recently, most machine vision tasks are based on decoded images which require an image compression (encoding/decoding) framework. However, using the decoded images in the pixel-domain has two drawbacks: 1) the complexity is high for the decoder part, 2) the accuracy (e.g., mIoU, mean absolute error, and average precision) of machine vision tasks will be degraded since decoded images only aim to optimize the human perceived quality (e.g., PSNR) so that information required for machine vision tasks will be lost during the decoding process. In this paper, we improve the machine vision tasks in the compressed domain. 1) A gate module is utilized to effectively select some compressed-domain features. 2) Knowledge distillation is introduced to improve the accuracy. 3) A training strategy is explored to support multiple tasks including the image compression. The experimental results show that we can achieve better rate-accuracy/distortion and lower complexity compared with the state-of-the-art pixel-domain work that can take both machine and human vision tasks.
- Research Article
13
- 10.1109/access.2023.3261668
- Jan 1, 2023
- IEEE Access
In the Internet of Things (IoT) communications, visual data is frequently processed among intelligent devices using artificial intelligence algorithms, replacing humans for analyzing and decision-making while only occasionally requiring human’s scrutiny. However, due to high redundancy of compressive encoders, existing image coding solutions for machine vision are not efficient at runtime. To balance the rate-accuracy performance and efficiency of image compression for machine vision while attaining high-quality reconstructed images for human vision, this paper introduces a novel slimmable multi-task compression framework for human and machine vision in visual IoT applications. Firstly, the image compression for human and machine vision under the constraint of bandwidth, latency, computational resources are modelled as a multi-task optimization problem. Secondly, slimmable encoders are employed to multiple human and machine vision tasks in which the parameters of the sub-encoder for machine vision tasks are shared among all tasks and jointly learned. Thirdly, to solve the feature match between latent representation and intermediate features of deep vision networks, feature transformation networks are introduced as decoders of machine vision feature compression. Finally, the proposed framework is successfully applied to human and machine vision tasks’ scenarios, e.g., object detection and image reconstruction. Experimental results show that the proposed method outperforms baselines and other image compression approaches on machine vision tasks with higher efficiency (shorter latency) in two vision tasks’ scenarios while retaining comparable quality on image reconstruction.
- Research Article
1
- 10.1117/1.jei.31.2.023014
- Mar 21, 2022
- Journal of Electronic Imaging
The existing image compression methods are mainly aimed at human perception tasks instead of machine vision tasks. Rich features learned by the shallow and deep layers of pre-trained visual geometry group (VGG)-net can serve human perception and machine vision tasks, respectively. To improve the machine analysis capabilities of the human-targeted compression method, we propose a scalable image compression framework at low bit-rates. Specifically, the scalable compression framework is composed of a base layer (BL) and an enhancement layer (EL), which utilize the correlation between the above features to perform machine analysis and human perception tasks, respectively. For effectively utilizing the above two types of features, a multi-branch shared module that utilizes the complementarity of multi-branch convolution kernels to retain compact information for the above features to support BL and EL tasks is proposed. In addition, for further improving the accuracy of machine analysis, the machine-vision importance map is introduced in BL; it adaptively utilizes the spatial and channel information from the deepest layer of VGG-net to guide local bit allocation. When the bit-rate is limited to 0.2 bpp, the average recognition accuracy (Top-1) of the BL of proposed method is 5.2%, 13.4%, 6.0%, and 7.4% higher than that of BPG, Webp, Mentzer, and NIC, respectively, on the ILSVRC2012 verification dataset. Meanwhile, the EL provides good visual experience.
- Conference Article
82
- 10.1109/icme46284.2020.9102750
- Jul 1, 2020
The past decades have witnessed the rapid development of image and video coding techniques in the era of big data. However, the signal fidelity-driven coding pipeline design limits the capability of the existing image/video coding frameworks to fulfill the needs of both machine and human vision. In this paper, we come up with a novel image coding framework by leveraging both the compressive and the generative models, to support machine vision and human perception tasks jointly. Given an input image, the feature analysis is first applied, and then the generative model is employed to perform image reconstruction with features and additional reference pixels, in which compact edge maps are extracted in this work to connect both kinds of vision in a scalable way. The compact edge map serves as the basic layer for machine vision tasks, and the reference pixels act as a sort of enhanced layer to guarantee signal fidelity for human vision. By introducing advanced generative models, we train a flexible network to reconstruct images from compact feature representations and the reference pixels. Experimental results demonstrate the superiority of our framework in both human visual quality and facial landmark detection, which provide useful evidence on the emerging standardization efforts on MPEG VCM (Video Coding for Machine). Our project website is available at https://williamyang1991.github.io/projects/VCM-Face/.
- Research Article
3
- 10.1109/access.2023.3263207
- Jan 1, 2023
- IEEE Access
Image analysis based on machine vision is hugely manipulated in the smart industry. Good-quality images are required for outstanding machine analysis results, but handling high-definition images could be problematic in a constrained environment such as a low-bandwidth network or low-capacity storage. Lowering the image resolution might be a straightforward solution for reducing image data, but it would occasion much information loss, leading to the deterioration of machine vision. Moreover, human supervision could be necessary for a contingency that machine vision cannot control.Therefore, an innovative image compression method considering machine and human vision is required; more compression efficiency than the state-of-the-art codec, praiseworthy machine vision performance, and human-recognizable quality. In this paper, we propose Versatile video coding(VVC) based image compression for hybrid vision, i.e., machine vision and human vision. Our work provides a coding tree unit(CTU) level image compression with dual quantization parameters (QPs) according to the quantization parameter map and the saliency extracted by the object detection network; in the salient region, the proposed method maintains high quality with low QP but degrades the quality with high QP in the non-salient region.Compared with VVC, the proposed compression method achieves a bitrate reduction of up to 25.5% in machine vision tasks, proving more compression efficiency and still admirable machine vision performance. From the perspective of human vision, the proposed method provides human-perceptible image quality, preserving acceptable objective quality values.
- Research Article
2
- 10.1145/3678471
- Oct 15, 2024
- ACM Transactions on Multimedia Computing, Communications, and Applications
Reconstruction-free image compression for machine vision aims to perform machine vision tasks directly on compressed-domain representations instead of reconstructed images. Existing reports have validated the feasibility of compressed-domain machine vision. However, we observe that when using recently learned compression models, the performance gap between compressed-domain and pixel-domain vision tasks is still large due to the lack of some natural inductive biases in pixel-domain convolutional neural networks. In this article, we attempt to address this problem by transferring knowledge from the pixel domain to the compressed domain. A knowledge transfer loss defined at both output level and feature level is proposed to narrow the gap between the compressed domain and the pixel domain. In addition, we modify neural networks for pixel-domain vision tasks to better suit compressed-domain inputs. Experimental results on several machine vision tasks show that the proposed method improves the accuracy of compressed-domain vision tasks significantly, which even outperforms learning on reconstructed images while avoiding the computational cost of image reconstruction.
- Research Article
3
- 10.1109/tcsvt.2023.3274739
- Apr 1, 2025
- IEEE Transactions on Circuits and Systems for Video Technology
Although the recent learning-based image and video coding techniques achieve rapid development, the signal fidelity-driven target in these methods leads to the divergence to a highly effective and efficient coding framework for both human and machine. In this paper, we aim to address the issue by making use of the power of generative models to bridge the gap between full fidelity (for human vision) and high discrimination (for machine vision). Therefore, relying on existing pretrained generative adversarial networks (GAN), we build a GAN inversion framework that projects the image into a low-dimensional natural image manifold. In this manifold, the feature is highly discriminative and also encodes the appearance information of the image, named as <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">latent code</i> . Taking a variational bit-rate constraint with a hyperprior model to model/suppress the entropy of image manifold code, our method is capable of fulfilling the needs of both machine and human visions at very low bit-rates. To improve the visual quality of image reconstruction, we further propose <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multiple latent codes</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">scalable inversion</i> . The former gets several latent codes in the inversion, while the latter additionally compresses and transmits a shallow compact feature to support visual reconstruction. Experimental results demonstrate the superiority of our method in both human vision tasks, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e</i> . image reconstruction, and machine vision tasks, including semantic parsing and attribute prediction.
- Conference Article
27
- 10.1109/icme46284.2020.9102843
- Jul 1, 2020
In this paper, we study a new problem arising from the emerging MPEG standardization effort Video Coding for Machine (VCM) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> , which aims to bridge the gap between visual feature compression and classical video coding. VCM is committed to address the requirement of compact signal representation for both machine and human vision in a more or less scalable way. To this end, we make endeavors in leveraging the strength of predictive and generative models to support advanced compression techniques for both machine and human vision tasks simultaneously, in which visual features serve as a bridge to connect signal-level and task-level compact representations in a scalable manner. Specifically, we employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames via a generative model, relying on the appearance of the coded key frames. Meanwhile, the sparse motion pattern is compact and highly effective for high-level vision tasks, e.g. action recognition. Experimental results demonstrate that our method yields much better reconstruction quality compared with the traditional video codecs (0.0063 gain in SSIM), as well as state-of-the-art action recognition performance over highly compressed videos (9.4% gain in recognition accuracy), which showcases a promising paradigm of coding signal for both human and machine vision.
- Research Article
7
- 10.1007/bf01414883
- Sep 1, 1998
- Neural Computing & Applications
A variety of computational tasks in early vision can be formulated through lattice networks. The cooperative action of these networks depends upon the topology of interconnections, both feedforward and recurrent ones. The Gabor-like impulse response of a 2nd-order lattice network (i.e. with nearest and next-to-nearest interconnections) is analysed in detail, pointing out how a near-optimal filtering behaviour in space and frequency domains can be achieved through excitatory/inhibitory interactions without impairing the stability of the system. These architectures can be mapped, very efficiently at transistor level, on VLSI structures operating as analogue perceptual engines. The hardware implementation of early vision tasks can, indeed, be tackled by combining these perceptual agents through suitable weighted sums. Various implementation strategies have been pursued with reference to: (i) the algorithm-circuit mapping (current-mode and transconductor approaches); (ii) the degree of programmability (fixed, selectable and tunable); and (iii) the implementation technology (2μ and 0.8μ gate lengths). Applications of the perceptual engine to machine vision algorithms are discussed.
- Research Article
5
- 10.11591/ijra.v11i2.pp111-121
- Jun 1, 2022
- IAES International Journal of Robotics and Automation (IJRA)
<p>Machine vision or robot vision plays is playing an important role in many industrial systems and has a lot of potential applications in the future of automation tasks such as in-house robot managing, swarm robotics controlling, product line observing, and robot grasping. One of the most common yet challenging tasks in machine vision is 3D object localization. Although several works have been introduced and achieved good results for object localization, there is still room to further improve the object location determination. In this paper, we introduce a novel 3D object localization algorithm in which a checkerboard pattern-based method is used to initialize the object location and followed by a regression model to regularize the object location. The proposed object localization is employed in a low-cost robot grasping system where only one simple 2D camera is used. Experimental results showed that the proposed algorithm significantly improves the accuracy of the object localization when compared to the relevant works.</p>
- Book Chapter
2
- 10.1007/978-1-4612-4532-2_11
- Jan 1, 1989
Shape-based (iconic) approaches play a vital role in the early stages of a computer vision system. Many computer vision applications require only 2-D information about objects. These applications allow the use of techniques that emphasize pictorial or iconic features. In this chapter we present an iconic approach using morphological image processing as a tool for analyzing images to recover 2-D information. We also briefly discuss a special architecture that allows very fast implementation of morphological operators to recover useful information in diverse applications. We demonstrate the efficacy of this approach by presenting details of an application. We show that the iconic approach offers features that could simplify many tasks in machine vision systems.KeywordsObject RecognitionBinary ImageMachine VisionComputer Vision SystemMachine Vision SystemThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Conference Article
10
- 10.1109/mmsp53017.2021.9733523
- Oct 6, 2021
Auxiliary attachments change is more frequent than essential clothing in human beings. Usually, auxiliary attachments include coat, jumper, hat and bag etc. It is one of the hardest recognition task in machine vision. It becomes more difficult if a specific person is reappearing after a longer time period while the other influential factors are angle variation and walking speed etc. One of the key application areas for person verification is border control where auxiliary attachments variation is more common. It is usually reflection of ethnicity or fashion. In machine vision, availability of such datasets is very limited, in particular, having reappearance after longer time period i.e., more than weeks or months. To overcome limited dataset problem, transfer learning is a leading solution for improved verification. In this paper, we proposed an aggregated deep learning model called, ApparelNet, more specifically for person verification in border control environment. We used Front-View Gait (FVG) to evaluate the performance of our aggregated model. The FVG is a pedestrian dataset of people encompassing auxiliary attachments variation, having three different angles from the camera and three different walking speeds. Our ApparelNet acquires single image based detection confidence using OpenPose and later additional layers of pre-trained EfficientNetB0 are trained on custom FVG dataset, including fine tuning of the overall EfficientNetB0. The EfficientNetB0 is highly efficient and scalable transfer learning model from the family of Deep-CNN. Overall, our ApparelNet reported training and validation accuracy of 98%, while looking at border control scenario, model verification is performed by selecting random images of 12 different individuals and prediction probability is computed which accumulates to 96%. In our opinion, the model has strong candidature for person reidentification where goal is one-to-many recognition. It may become an ancillary component of any biometrics system too.
- Research Article
6
- 10.5121/sipij.2011.2312
- Sep 30, 2011
- Signal & Image Processing : An International Journal
Image processing in machine vision is a challenging task because often real-time requirements have to be met in these systems. To accelerate the processing tasks in machine vision and to reduce data transfer latencies, new architectures for embedded systems in intelligent cameras are required. Furthermore, innovative processing approaches are necessary to realize these architectures efficiently. Marching Pixels are such a processing scheme, based on Organic Computing principles, and can be applied for example to determine object centroids in binary or gray-scale images. In this paper, we present a processing pipeline for smart camera systems utilizing such Marching Pixel algorithms. It consists of a buffering template for image pre-processing tasks in a FPGA to enhance captured images and an ASIC for the efficient realization of Marching Pixel approaches. The ASIC achieves a speedup of eight for the realization of Marching Pixel algorithms, compared to a common medium performance DSP platform.