DPVOC: Dual-Prompt for Learned Variable Bitrate Omnidirectional Image Compression
This paper introduces DPVOC, a dual-prompt learned variable bitrate omnidirectional image compression framework utilizing a dual-branch architecture with CNNs and Swin Transformers. By employing distortion and quality maps as prompts, DPVOC enables region-adaptive bit allocation, achieving superior compression performance with low computational complexity, validated through experiments.
With the widespread adoption of consumer electronic devices such as virtual reality (VR) headsets, panoramic cameras, and ultra-high-definition displays, omnidirectional (360°) images have become increasingly important for providing immersive user experiences. However, the high resolution and data volume of these images pose significant challenges for bandwidth-limited and resource-constrained consumer electronics. To address these challenges, based on an advanced parallel dual-branch hybrid architecture (TCM) consisting of convolutional neural networks (CNNs) and Swin Transformer, we propose a dual-prompt learned variable bitrate omnidirectional image compression framework, termed DPVOC, which utilizes distortion maps (Dmaps) and quality maps (Qmaps) as dual prompts to enable region-adaptive bit allocation and achieve efficient variable bitrate compression. Specifically, during training, to alleviate the computational burden of processing entire ERP images, we randomly crop ERP images into patches as input to the network. Considering the varying degrees of distortion redundancy across different regions of ERP patches, we introduce corresponding Dmap patches to record the local distortion levels. In the CNN branch, the patch-wise uniform Qmaps are element-wise multiplied with the Dmaps to modulate the CNN features. In the Swin Transformer branch, the uniform Qmap patches are used as prompts in the attention mechanism to guide the feature embeddings for adaptability to bitrate variations. Additionally, Dmap patches are introduced into the feedforward network (FFN) of the Swin Transformer to suppress redundant information. By incorporating fine-grained and symmetric prompts from both Qmaps and Dmaps into the encoder and decoder through the dual-branch structure, our networks can effectively adapt to diverse bitrate requirements. During inference, entire Qmaps and Dmaps are used as inputs, and their bitrate overhead is negligible. Experimental results demonstrate that DPVOC achieves superior performance in omnidirectional image compression while maintaining low computational complexity.
- Research Article
- 10.3126/sxcj.v1i1.70879
- Oct 18, 2024
- SXC Journal
Identifying images poses a challenge in computer vision, but the use of deep learning methods has greatly enhanced the performance of image classification systems. In this research, Convolutional Neural Networks (CNN) and Feed Forward Neural Networks (FFNN) have been utilized for image classification. CNN is extremely effective in picture classification, which extracts relevant information from images using convolutional and pooling layers to minimize the dimensionality of the derived features, while FFNN algorithm is a classic neural network with fully linked layers. It can be used to further process the features extracted by CNN. The study makes use of CNN and FFNN models to train a huge dataset of tomato images to categorize them based on their type, ripeness, and damage status. CNN is found to be more effective in the case of tomato classification as compared to FFNN algorithm in all the use cases. The accuracy for classification of an image (tomato or not) using CNN is 95.83%, type classification using CNN is 81.52%, whereas using FFNN is 66.30%; ripeness grading for CNN is 92.86%, whereas for FFNN it is 57.14%; and damage status grading is 92.86% using CNN and 67.86% using FFNN. Therefore, it can be concluded that quality processing of tomatoes can be improved using CNN.
- Conference Article
29
- 10.1109/icpr48806.2021.9412493
- Jan 10, 2021
In this paper, we study the task of facial expression recognition under strong occlusion. We are particularly interested in cases where 50 % of the face is occluded, e.g. when the subject wears a Virtual Reality (VR) headset. While previous studies show that pre-training convolutional neural networks (CNNs) on fully-visible (non-occluded) faces improves the accuracy, we propose to employ knowledge distillation to achieve further improvements. First of all, we employ the classic teacher-student training strategy, in which the teacher is a CNN trained on fully-visible faces and the student is a CNN trained on occluded faces. Second of all, we propose a new approach for knowledge distillation based on triplet loss. During training, the goal is to reduce the distance between an anchor embedding, produced by a student CNN that takes occluded faces as input, and a positive embedding (from the same class as the anchor), produced by a teacher CNN trained on fully-visible faces, so that it becomes smaller than the distance between the anchor and a negative embedding (from a different class than the anchor), produced by the student CNN. Third of all, we propose to combine the distilled embeddings obtained through the classic teacher-student strategy and our novel teacher-student strategy based on triplet loss into a single embedding vector. We conduct experiments on two benchmarks, FER+ and AffectNet, with two CNN architectures, VGG-f and VGG-face, showing that knowledge distillation can bring significant improvements over the state-of-the-art methods designed for occluded faces in the VR setting. Furthermore, we obtain accuracy rates that are quite close to the state-of-the-art models that take as input fully-visible faces. For example, on the FER+ data set, our VGG-face based on concatenated distilled embeddings attains an accuracy rate of 82.75% on lower-half-visible faces, which is only 2.24% below the accuracy rate of a state-of-the-art VGG-13 that is evaluated on fully-visible faces. Given that our model sees only the lower-half of the face, we consider this to be a remarkable achievement. In conclusion, we consider that our distilled CNN models can provide useful feedback for the task of recognizing the facial expressions of a person wearing a VR headset.
- Video Transcripts
- 10.48448/cmzx-ez05
- Dec 29, 2020
- Underline Science Inc.
In this paper, we study the task of facial expression recognition under strong occlusion. We are particularly interested in cases where 50\% of the face is occluded, e.g. when the subject wears a Virtual Reality (VR) headset. While previous studies show that pre-training convolutional neural networks (CNNs) on fully-visible (non-occluded) faces improves the accuracy, we propose to employ knowledge distillation to achieve further improvements. First of all, we employ the classic teacher-student training strategy, in which the teacher is a CNN trained on fully-visible faces and the student is a CNN trained on occluded faces. Second of all, we propose a new approach for knowledge distillation based on triplet loss. During training, the goal is to reduce the distance between an anchor embedding, produced by a student CNN that takes occluded faces as input, and a positive embedding (from the same class as the anchor), produced by a teacher CNN trained on fully-visible faces, so that it becomes smaller than the distance between the anchor and a negative embedding (from a different class than the anchor), produced by the student CNN. Third of all, we propose to combine the distilled embeddings obtained through the classic teacher-student strategy and our novel teacher-student strategy based on triplet loss into a single embedding vector. We conduct experiments on two benchmarks, FER+ and AffectNet, with two CNN architectures, VGG-f and VGG-face, showing that knowledge distillation can bring significant improvements over the state-of-the-art methods designed for occluded faces in the VR setting. Furthermore, we obtain accuracy rates that are quite close to the state-of-the-art models that take as input fully-visible faces. For example, on the FER+ data set, our VGG-face based on concatenated distilled embeddings attains an accuracy rate of 82.75\% on lower-half-visible faces, which is only 2.24\% below the accuracy rate of a state-of-the-art VGG-13 that is evaluated on fully-visible faces. Given that our model sees only the lower-half of the face, we consider this to be a remarkable achievement. In conclusion, we consider that our distilled CNN models can provide useful feedback for the task of recognizing the facial expressions of a person wearing a VR headset.
- Research Article
13
- 10.1109/tcds.2021.3052526
- Jun 1, 2022
- IEEE Transactions on Cognitive and Developmental Systems
Current top-performing saliency prediction methods of omnidirectional images (ODIs) depend on deep feedforward convolutional neural networks (CNNs), benefiting from their powerful multiscale representation ability. Although these methods adopt deep feedforward CNNs to achieve superb performance in saliency prediction task, they have the following limitations: 1) these deep feedforward CNNs are difficult to map to ventral stream structure of the brain visual system due to their vast number of layers and missing biologically important connections, such as recurrence and 2) most deep feedforward CNNs represent the multiscale features in a layerwise manner. To tackle these issues, models that could learn multiscale features yet share the similarities with human brain are needed. In this article, we propose a novel multiscale brain-like network (MBN) model to predict saliency of head fixations on ODIs. Specifically, our proposed model consists of two major modules: 1) a brain-like CORnet-S module and 2) a multiscale feature extraction module. The CORnet-S module is a lightweight backbone network with four anatomically mapped areas (V1, V2, V4, and IT) and it can simulate the visual processing mechanism of ventral visual stream in the human brain. The multiscale feature extraction module is inspired by the multiscale brain structure, which represents multiscale features at a granular level and increases the range of receptive fields for each network layer. Extensive experiments and ablation studies conducted on two major benchmarks demonstrate the superiority of the proposed MBN model over the state-of-the-art methods.
- Research Article
31
- 10.1609/aaai.v36i1.19937
- Jun 28, 2022
- Proceedings of the AAAI Conference on Artificial Intelligence
Omnidirectional images, also called 360◦images, have attracted extensive attention in recent years, due to the rapid development of virtual reality (VR) technologies. During omnidirectional image processing including capture, transmission, consumption, and so on, measuring the perceptual quality of omnidirectional images is highly desired, since it plays a great role in guaranteeing the immersive quality of experience (IQoE). In this paper, we conduct a comprehensive study on the perceptual quality of omnidirectional images from both subjective and objective perspectives. Specifically, we construct the largest so far subjective omnidirectional image quality database, where we consider several key influential elements, i.e., realistic non-uniform distortion, viewing condition, and viewing behavior, from the user view. In addition to subjective quality scores, we also record head and eye movement data. Besides, we make the first attempt by using the proposed database to train a convolutional neural network (CNN) for blind omnidirectional image quality assessment. To be consistent with the human viewing behavior in the VR device, we extract viewports from each omnidirectional image and incorporate the user viewing conditions naturally in the proposed model. The proposed model is composed of two parts, including a multi-scale CNN-based feature extraction module and a perceptual quality prediction module. The feature extraction module is used to incorporate the multi-scale features, and the perceptual quality prediction module is designed to regress them to perceived quality scores. The experimental results on our database verify that the proposed model achieves the competing performance compared with the state-of-the-art methods.
- Research Article
170
- 10.1016/j.neucom.2021.10.036
- Oct 14, 2021
- Neurocomputing
We propose a new type of neural networks, Kronecker neural networks (KNNs), that form a general framework for neural networks with adaptive activation functions. KNNs employ the Kronecker product, which provides an efficient way of constructing a very wide network while keeping the number of parameters low. Our theoretical analysis reveals that under suitable conditions, KNNs induce a faster decay of the loss than that by the feed-forward networks. This is also empirically verified through a set of computational examples. Furthermore, under certain technical assumptions, we establish global convergence of gradient descent for KNNs. As a specific case, we propose the Rowdy activation function that is designed to get rid of any saturation region by injecting sinusoidal fluctuations, which include trainable parameters. The proposed Rowdy activation function can be employed in any neural network architecture like feed-forward neural networks, Recurrent neural networks, Convolutional neural networks etc. The effectiveness of KNNs with Rowdy activation is demonstrated through various computational experiments including function approximation using feed-forward neural networks, solution inference of partial differential equations using the physics-informed neural networks, and standard deep learning benchmark problems using convolutional and fully-connected neural networks.
- Research Article
23
- 10.1155/2018/9327536
- Jan 1, 2018
- Complexity
Making every component of an electrical system work in unison is being made more challenging by the increasing number of renewable energies used, the electrical output of which is difficult to determine beforehand. In Spain, the daily electricity market opens with a 12‐hour lead time, where the supply and demand expected for the following 24 hours are presented. When estimating the generation, energy sources like nuclear are highly stable, while peaking power plants can be run as necessary. Renewable energies, however, which should eventually replace peakers insofar as possible, are reliant on meteorological conditions. In this paper we propose using different deep‐learning techniques and architectures to solve the problem of predicting wind generation in order to participate in the daily market, by making predictions 12 and 36 hours in advance. We develop and compare various estimators based on feedforward, convolutional, and recurrent neural networks. These estimators were trained and validated with data from a wind farm located on the island of Tenerife. We show that the best candidates for each type are more precise than the reference estimator and the polynomial regression currently used at the wind farm. We also conduct a sensitivity analysis to determine which estimator type is most robust to perturbations. An analysis of our findings shows that the most accurate and robust estimators are those based on feedforward neural networks with a SELU activation function and convolutional neural networks.
- Conference Article
1
- 10.1136/archdischild-2019-rcpch.434
- May 1, 2019
<h3>Aims</h3> To assess the use of a Virtual reality (VR) head set as a distraction technique in children undergoing short painful procedures (cannulation, venepuncture, wound closure or foreign body removal) in the paediatric Emergency Department. <h3>Methods</h3> We compared how distracted children were with VR (Pico Goblin headset, using the 'Happy Place' animated interactive 360 degree experience), with an equivalent group of children with traditional distraction (TD) methods (a play specialist and the child's choice of book, game or tablet computer). Children aged 5 and above were recruited. We excluded children with head, neck or facial injury, history of epilepsy, and nausea or vomiting on presentation. Twenty patients were recruited to each group. Staff rated how distracted the child had been during the procedure using the Children's Emotional Manifestation Scale (CEMS) and rated their pain behaviours using the Face Legs Arms Cry Consolability (FLACC) Scale. Parents also completed a questionnaire on their experience. <h3>Results</h3> Patients using VR were more distracted compared those receiving TD (average CEMS score=5 with VR, 6 with TD) but this was not statistically significant (p=0.74). Patients using VR showed fewer reactive pain behaviours than the TD group (FLACC score 0 vs 1.5; p=0.004). Written feedback from parents regarding VR was positive, and staff were enthusiastic about the success of the new technology for distraction. <h3>Conclusion</h3> Thirty percent of VR headset use in our department occurred out of hours. A VR headset may provide emergency department staff with a convenient way to offer procedural distraction to children out of hours if there is no play specialist available. We would strongly advocate that a VR headset is not a substitute for the skills and experience of a play specialist who can provide individualised distraction to children, regardless of age, presenting complaint or past medical history. A VR headset is a novel tool that can add to the variety of distraction methods available to play specialists in the Paediatric Emergency Department, but its use should be judged on a case by case basis.
- Research Article
7
- 10.1111/iej.14252
- May 12, 2025
- International endodontic journal
The use of haptic virtual reality simulators in preclinical dental education is evolving rapidly. However, the application of immersive haptic simulations for specific dental procedures, such as access cavity preparation, has not been extensively explored. This study aimed to (i) evaluate the impact of using the VirTeaSy Dental® simulator in conjunction with a virtual reality (VR) headset on student performance during access cavity preparation, with a focus on haptic parameters; (ii) assess students' perceptions of the experience; and (iii) examine the side effects associated with VR headset use. The study included 90 third-year dental students from the Dental Faculty of Nantes University, enrolled in January 2023. Participants were divided into two parallel groups. In Phase 1, Group 1 (n = 45) completed two endodontic access cavity exercises on the VirTeaSy Dental® without the VR headset, whilst Group 2 performed the same exercises using the VR headset. In Phase 2, the groups switched conditions and followed the same protocol. Performance was assessed using haptic parameters, and comparisons between groups for each phase were made using parametric and non-parametric tests (p < .05). Students also completed questionnaires to assess their experience and report any side effects from using the VR headset. Across both groups and phases, participants performed better in access cavity preparation without the VR headset. They showed greater accuracy, made fewer errors, and completed the exercises more quickly. Notably, more students failed to complete the exercises within the 10-minute time limit when using the VR headset (27 vs. 12 in Group 1, 23 vs. 13 in Group 2). Most participants expressed a preference for using VirTeaSy Dental® without the VR headset. Approximately, 20% of students reported side effects, including dizziness, nausea, migraines, and neck muscle fatigue. The results suggest that full immersion in haptic simulation, when paired with a VR headset, negatively impacts student performance in complex tasks such as access cavity preparation. These findings underscore the current limitations of immersive virtual reality in dental education and highlight the need for technical refinements before its widespread adoption in preclinical training.
- Research Article
2
- 10.3390/metabo15030174
- Mar 3, 2025
- Metabolites
Background/Objectives: Metabolomics has recently emerged as a key tool in the biological sciences, offering insights into metabolic pathways and processes. Over the last decade, network-based machine learning approaches have gained significant popularity and application across various fields. While several studies have utilized metabolomics profiles for sample classification, many network-based machine learning approaches remain unexplored for metabolomic-based classification tasks. This study aims to compare the performance of various network-based machine learning approaches, including recently developed methods, in metabolomics-based classification. Methods: A standard data preprocessing procedure was applied to 17 metabolomic datasets, and Bayesian neural network (BNN), convolutional neural network (CNN), feedforward neural network (FNN), Kolmogorov-Arnold network (KAN), and spiking neural network (SNN) were evaluated on each dataset. The datasets varied widely in size, mass spectrometry method, and response variable. Results: With respect to AUC on test data, BNN, CNN, FNN, KAN, and SNN were the top-performing models in 4, 1, 5, 3, and 4 of the 17 datasets, respectively. Regarding F1-score, the top-performing models were BNN (3 datasets), CNN (3 datasets), FNN (4 datasets), KAN (4 datasets), and SNN (3 datasets). For accuracy, BNN, CNN, FNN, KAN, and SNN performed best in 4, 1, 4, 4, and 4 datasets, respectively. Conclusions: No network-based modeling approach consistently outperformed others across the metrics of AUC, F1-score, or accuracy. Our results indicate that while no single network-based modeling approach is superior for metabolomics-based classification tasks, BNN, KAN, and SNN may be underappreciated and underutilized relative to the more commonly used CNN and FNN.
- Conference Article
1
- 10.1109/icpr48806.2021.9412001
- Jan 10, 2021
Deep feedforward convolutional neural networks (CNNs) perform well in the saliency prediction of omnidirectional images (ODIs), and have become the leading class of candidate models of the visual processing mechanism in the primate ventral stream. These CNNs have evolved from shallow network architecture to extremely deep and branching architecture to achieve superb performance in various vision tasks, yet it is unclear how brain-like they are. In particular, these deep feedforward CNNs are difficult to mapping to ventral stream structure of the brain visual system due to their vast number of layers and missing biologically-important connections, such as recurrence. To tackle this issue, some brain-like shallow neural networks are introduced. In this paper, we propose a novel brain-like network model for saliency prediction of head fixations on ODIs. Specifically, our proposed model consists of three modules: a CORnet-S module, a template feature extraction module and a ranking attention module (RAM). The CORnet-S module is a lightweight artificial neural network (ANN) with four anatomically mapped areas (V1, V2, V4 and IT) and it can simulate the visual processing mechanism of ventral visual stream in the human brain. The template features extraction module is introduced to extract attention maps of ODIs and provide guidance for the feature ranking in the following RAM module. The RAM module is used to rank and select features that are important for fine-grained saliency prediction. Extensive experiments have validated the effectiveness of the proposed model in predicting saliency maps of ODIs, and the proposed model outperforms other state-of-the-art methods with similar scale.
- Video Transcripts
- 10.48448/kxd4-zh15
- Dec 29, 2020
- Underline Science Inc.
Deep feedforward convolutional neural networks (CNNs) perform well in the saliency prediction of omnidirectional images (ODIs), and have become the leading class of candidate models of the visual processing mechanism in the primate ventral stream. These CNNs have evolved from shallow network architecture to extremely deep and branching architecture to achieve superb performance in various vision tasks, yet it is unclear how brain-like they are. In particular, these deep feedforward CNNs are difficult to mapping to ventral stream structure of the brain visual system due to their vast number of layers and missing biologically-important connections, such as recurrence. To tackle this issue, some brain-like shallow neural networks are introduced. In this paper, we propose a novel brain-like network model for saliency prediction of head fixations on ODIs. Specifically, our proposed model consists of three modules: a CORnet-S module, a template feature extraction module and a ranking attention module (RAM). The CORnetS module is a lightweight artificial neural network (ANN) with four anatomically mapped areas (V1, V2, V4 and IT) and it can simulate the visual processing mechanism of ventral visual stream in the human brain. The template features extraction module is introduced to extract attention maps of ODIs and provide guidance for the feature ranking in the following RAM module. The RAM module is used to rank and select features that are important for fine-grained saliency prediction. Extensive experiments have validated the effectiveness of the proposed model in predicting saliency maps of ODIs.
- Research Article
16
- 10.1109/tcsvt.2022.3229701
- Jun 1, 2023
- IEEE Transactions on Circuits and Systems for Video Technology
Progressive coding is essential to the practical deployment of learned image compression over heterogeneous networks and clients. Existing methods for learned progressive image compression require complex and empirical design to achieve near-optimal rate-distortion performance over a wide range of bit-rates. However, these methods are limited by the implicit learned mechanism based on neural networks and introduction of uniform quantizers. In this paper, we propose generalized learned progressive image compression with analytic rate-distortion optimization using dead-zone quantizers on the latent representation. Specifically, we reveal that dead-zone quantizers, as a general case of uniform quantizers, are equivalent to uniform quantizers in fixed-rate nonlinear transform coding and can prevent extra redundancy in embedded quantization for progressive coding. Consequently, we propose rate-distortion optimized learned progressive coding by approximating the optimal quantizer in the source spaces using dead-zone quantizers in an analytic manner on the Laplacian source. To our best knowledge, this paper is the first to achieve general learned progressive coding from the perspective of optimal quantizers. The proposed method achieves theoretically sound and practically efficient embedded quantization and learned progressive coding of latent representations with improved rate-distortion performance. It can also enable embedded quantization with diverse assignments of truncation points and support flexible configuration of quality layers of varying numbers and at varying target bit-rates. Furthermore, we successfully incorporate the proposed method into existing pre-trained fixed-rate models to realize progressive learned image compression without re-training. Experimental results demonstrate that the proposed method achieves state-of-the-art rate-distortion performance in learned progressive image compression compared with traditional codecs and recent learned methods.
- Research Article
23
- 10.1109/tip.2022.3202357
- Jan 1, 2022
- IEEE Transactions on Image Processing
State-of-the-art 2D image compression schemes rely on the power of convolutional neural networks (CNNs). Although CNNs offer promising perspectives for 2D image compression, extending such models to omnidirectional images is not straightforward. First, omnidirectional images have specific spatial and statistical properties that can not be fully captured by current CNN models. Second, basic mathematical operations composing a CNN architecture, e.g., translation and sampling, are not well-defined on the sphere. In this paper, we study the learning of representation models for omnidirectional images and propose to use the properties of HEALPix uniform sampling of the sphere to redefine the mathematical tools used in deep learning models for omnidirectional images. In particular, we: i) propose the definition of a new convolution operation on the sphere that keeps the high expressiveness and the low complexity of a classical 2D convolution; ii) adapt standard CNN techniques such as stride, iterative aggregation, and pixel shuffling to the spherical domain; and then iii) apply our new framework to the task of omnidirectional image compression. Our experiments show that our proposed on-the-sphere solution leads to a better compression gain that can save 13.7% of the bit rate compared to similar learned models applied to equirectangular images. Also, compared to learning models based on graph convolutional networks, our solution supports more expressive filters that can preserve high frequencies and provide a better perceptual quality of the compressed images. Such results demonstrate the efficiency of the proposed framework, which opens new research venues for other omnidirectional vision tasks to be effectively implemented on the sphere manifold.
- Conference Article
10
- 10.5151/proceedings-ecaadesigradi2019_339
- Dec 1, 2019
In this study, we developed a method for generating omnidirectional depth images from corresponding omnidirectional RGB images of streetscapes by learning each pair of omnidirectional RGB and depth images created by computer graphics using pix2pix. Then, the models trained with different series of images shot under different site and weather conditions were applied to Google street view images to generate depth images. The validity of the generated depth images was then evaluated visually. In addition, we conducted experiments to evaluate Google street view images using multiple participants. We constructed a model that estimates the evaluation value of these images with and without the depth images using the learning-to-rank method with deep convolutional neural network. The results demonstrate the extent to which the generalization performance of the streetscape evaluation model changes depending on the presence or absence of depth images.