ODVISTA: An Omnidirectional Video Dataset for Super-Resolution and Quality Enhancement Tasks

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Omnidirectional or 360-degree video is being increasingly deployed, largely due to the latest advancements in immersive virtual reality (VR) and extended reality (XR) technology. However, the adoption of these videos in streaming encounters challenges related to bandwidth and latency, particularly in mobility conditions such as with unmanned aerial vehicles (UAVs). Adaptive resolution and compression aim to preserve quality while maintaining low latency under these constraints, yet downscaling and encoding can still degrade quality and introduce artifacts. Machine learning (ML)-based super-resolution (SR) and quality enhancement techniques offer a promising solution by enhancing detail recovery and reducing compression artifacts. However, current publicly available 360-degree video SR datasets lack compression artifacts, which limit research in this field. To bridge this gap, this paper introduces omnidirectional video streaming dataset (ODVista), which comprises 200 highresolution and high-quality videos downscaled and encoded at four bitrate ranges using the high-efficiency video coding (HEVC)/H.265 standard. Evaluations show that the dataset not only features a wide variety of scenes but also spans different levels of content complexity, which is crucial for robust solutions that perform well in real-world scenarios and generalize across diverse visual environments. Additionally, we evaluate the performance, considering both quality enhancement and runtime, of two handcrafted and two MLbased SR models on the validation and testing sets of ODVista.

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.1177/16094069251397351
360-Degree Video for Whole Scene Capture: From Immersive Realism to Immersive Holism in Place-Based Research
  • Nov 1, 2025
  • International Journal of Qualitative Methods
  • Jonathan Cinnamon + 3 more

360-degree video is an affordable and easy-to-use technology for social science research. It holds significant potential for capturing spatio-temporal aspects of the social world from a fully omni-directional spatial perspective; however, gaps remain as to how it can be used to support field-based data collection and analysis. In this short piece we offer two contributions to the literature on 360-degree video for qualitative social science research on place. First, we draw on evidence from our multi-city study of ‘urban platform temporalities’ to develop a step-by-step procedure for producing and analyzing 360-degree digital video datasets, demonstrating the potential of the technology for what we term whole scene capture . We provide practical advice on software, hardware, camera usage, video processing, and ethical considerations; and introduce the 360-video qualitative coding technique of spherical simultaneous perspective . Adding new evidence of its use to already established literatures on 360-degree immersive video ethnographies and virtual human-environment exposure research, our method for systematic 360-degree capture of spatio-temporal data is applicable to a range of social science studies with a field-based data collection component. Finally, drawing together technological understandings of immersion from the field of VR with its ethnographic meaning, we then articulate the notion of immersive holism as a quality of 360-degree video that enables deep, meaningful, and comprehensive knowledge of place.

  • Conference Article
  • Cite Count Icon 35
  • 10.1109/icassp.2019.8683318
Towards Generating Ambisonics Using Audio-visual Cue for Virtual Reality
  • May 1, 2019
  • Aakanksha Rana + 2 more

Ambisonics i.e., a full-sphere surround sound, is quintessential with 360-degree visual content to provide a realistic virtual reality (VR) experience. While 360-degree visual content capture gained a tremendous boost recently, the estimation of corresponding spatial sound is still challenging due to the required sound-field microphones or information about the sound-source locations. In this paper, we introduce a novel problem of generating Ambisonics in 360-degree videos using the audio-visual cue. With this aim, firstly, a novel 360-degree audio-visual video dataset of 265 videos is introduced with annotated sound-source locations. Secondly, a pipeline is designed for an automatic Ambisonic estimation problem. Benefiting from the deep learning-based audio-visual feature-embedding and prediction modules, our pipeline estimates the 3D sound-source locations and further use such locations to encode to the B-format. To benchmark our dataset and pipeline, we additionally propose evaluation criteria to investigate the performance using different 360-degree input representations. Our results demonstrate the efficacy of the proposed pipeline and open up a new area of research in 360-degree audio-visual analysis for future investigations.

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3551641
Quality Enhancement of Compressed 360-Degree Videos Using Viewport-based Deep Neural Networks
  • Feb 6, 2023
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Qipu Qin + 1 more

360-degree video provides omnidirectional views by a bounding sphere, thus also called omnidirectional video. For omnidirectional video, people can only see specific content in the viewport through head movement, i.e., only a small portion of the 360-degree content is exposed at a given time. Therefore, the viewport quality is of particular importance for 360-degree videos. In this article, we propose a quality enhancement of compressed 360-degree videos using viewport-based deep neural networks, named V-DNN. V-DNN is mainly composed of two modules: viewport prediction network (VPN) and viewport quality enhancement network (VQEN). VPN based on spherical convolution and 2D convolution generates potential viewports for omnidirectional video. VQEN takes the current viewport and its reference viewports as the input and enhances residual for the current viewport based on bidirectional offset prediction and Spatio-temporal deformable convolutions. Compared with HM16.16 baseline at QP = 37 under the Low Delay P (LDP) configuration, experimental results show that V-DNN achieves an average 0.605 dB and 0.0139 gains in viewport-based ΔPSNR and ΔMS-SSIM, respectively, and is 0.379 dB (59.63%) and 0.0073 (110.61%) higher than the multi-frame quality enhancement (MFQE-2.0) scheme at QP = 37, respectively. Moreover, V-DNN consistently outperforms MFQE-1.0, MFQE-2.0, and HM16.16 baseline at the other QPs in terms of ΔPSNR, ΔWS-PSNR, and ΔMS-SSIM.

  • Research Article
  • Cite Count Icon 30
  • 10.1109/tmm.2023.3267294
Omnidirectional Video Super-Resolution Using Deep Learning
  • Jan 1, 2024
  • IEEE Transactions on Multimedia
  • Arbind Agrahari Baniya + 3 more

Omnidirectional Videos (or 360° videos) are widely used in Virtual Reality (VR) to facilitate immersive and interactive viewing experiences. However, the limited spatial resolution in 360° videos does not allow for each degree of view to be represented with adequate pixels, limiting the visual quality offered in the immersive experience. Deep learning Video Super-Resolution (VSR) techniques used for conventional videos could provide a promising software-based solution; however, these techniques do not tackle the distortion present in equirectangular projections of 360° video signals. An additional obstacle is the limited 360° video datasets to study. To address these issues, this paper creates a novel 360° Video Dataset (360VDS) with a study of the extensibility of conventional VSR models to 360° videos. This paper further proposes a novel deep learning model for 360° Video Super-Resolution (360° VSR), called Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO). S3PO adopts recurrent modelling with an attention mechanism, unbound from conventional VSR techniques like alignment. With a purpose-built feature extractor and a novel loss-function addressing spherical distortion, S3PO outperforms most state-of-the-art conventional VSR models and 360° specific super-resolution models on 360° video datasets. A step-wise ablation study is presented to understand and demonstrate the impact of the chosen architectural sub-components, targeted training and optimisation.

  • Conference Article
  • Cite Count Icon 4
  • 10.1117/12.2586403
Real-time object detection in 360-degree videos
  • Apr 12, 2021
  • Jounsup Park

Streaming of 360-degree videos over the internet is challenging task, but it provides rich multimedia experiences by allowing viewers to navigate 360-degree contents. The 360-degree videos need larger bandwidth and less latency to be streamed over the internet than the conventional videos. Therefore, non-visible area must be discarded from the video to save bandwidth. View prediction techniques have been used to predict visible area of the 360-degree video frames to be streamed. Linear regression using viewer’s past viewing behavior data is useful to predict short-term future behavior of the viewer, which is not useful when the network delay is longer than the prediction horizon. Object detection techniques help predicting viewers’ future motion for longer prediction horizon since the viewers tend to follow the objects that draw their attention. However, conventional object detection techniques using a convolutional neural network, such as YOLO, are difficult to be applied to 360-degree videos. There are distortions in the 360-degree videos when the spherical 360-degree video is projected into equi-rectangular videos for processing and storing purposes. A same object could have different shapes in the equi-rectangular video depends on their angular position in the sphere. Therefore, in this paper, we propose a multi-directional projection (MDP) technique to detect objects in the 360-degree videos. The proposed multi- directional projection technique mitigates the distortions in the equi-rectangular videos and feeds the redirected videos to the object detection system. Therefore, the neural network trained with conventional video dataset can be used without any change. Experimental result shows that the proposed method helps detecting objects in the edges of the 360-degree videos.

  • Research Article
  • Cite Count Icon 2
  • 10.1109/access.2022.3204331
Viewport History as a Heuristic for Quality Enhancement and Quality Variation Control in Viewport-Aware Tile-Based 360-Degree Video Streaming
  • Jan 1, 2022
  • IEEE Access
  • Kiana Dziubinski + 1 more

Despite the growing popularity of Virtual Reality (VR), 360-degree videos are often regarded as challenging to stream due to their large bandwidth requirement. As a solution, the 360-degree video content is spatially divided into tiles, and the quality level for each tile is selected based on the user’s network environment and viewport information. To determine the high quality tiles, viewport prediction and viewport history methods are used to estimate the user’s viewport. However, due to the unpredictability of user head movements, generating accurate viewport estimates are difficult, which can severely degrade the Quality of Experience (QoE) for the user. In this paper, to sustain high user QoE, we detail a novel tile quality selection algorithm that employs viewport prediction, viewport history, viewport extensions, and a viewport tile count limit. In addition, we also include comparison analysis on six 360-degree videos that vary in content pace. Based from simulations, using viewport history as a heuristic for tile quality selection demonstrated a significant increase in the perceived quality while suppressing quality variation inside the viewport and across segments compared to eight reference methods; and secondly, 360-degree videos slow in content pace tended to result in lower viewport prediction accuracy, QoE performance, and weaker viewport history trends compared to 360-degree videos fast in content pace.

  • Book Chapter
  • Cite Count Icon 14
  • 10.1007/978-981-32-9291-8_33
Activity Recognition for Indoor Fall Detection in 360-Degree Videos Using Deep Learning Techniques
  • Sep 20, 2019
  • Dhiraj + 8 more

Human activity recognition (HAR) targets the methodologies to recognize the different actions from a sequence of observations. Vision-based activity recognition is among the most popular unobtrusive technique for activity recognition. Caring for the elderly who are living alone from a remote location is one of the biggest challenges of modern human society and is an area of active research. The usage of smart homes with an increasing number of cameras in our daily environment provides the platform to use that technology for activity recognition also. The omnidirectional cameras can be utilized for fall detection activity which minimizes the requirement of multiple cameras for fall detection in an indoor living scenario. Consequently, two vision-based solutions have been proposed: one using convolutional neural networks in 3D-mode and another using a hybrid approach by combining convolutional neural networks and long short-term memory networks using 360-degree videos for human fall detection. An omnidirectional video dataset has been generated by recording a set of activities performed by different people as no such 360-degree video dataset is available in the public domain for human activity recognition. Both, the models provide fall detection accuracy of more than 90% for omnidirectional videos and can be used for developing a fall detection system for indoor health care.

  • Research Article
  • Cite Count Icon 9
  • 10.1109/tvcg.2023.3247462
Introducing 3D Thumbnails to Access 360-Degree Videos in Virtual Reality.
  • May 1, 2023
  • IEEE Transactions on Visualization and Computer Graphics
  • Alissa Vermast + 1 more

360° videos provide an immersive experience, especially when watched in virtual reality (VR). Yet, even though the video data is inherently three-dimensional, interfaces to access datasets of such videos in VR almost always use two-dimensional thumbnails shown in a grid on a flat or curved plane. We claim that using spherical and cube-shaped 3D thumbnails may provide a better user experience and be more effective at conveying the high-level subject matter of a video or when searching for a specific item in it. A comparative study against the most used existing representation, that is, 2D equirectangular projections, showed that the spherical 3D thumbnails did indeed provide the best user experience, whereas traditional 2D equirectangular projections still performed better for high-level classification tasks. Yet, they were outperformed by spherical thumbnails when participants had to search for details within the videos. Our results thus confirm a potential benefit of 3D thumbnail representations for 360-degree videos in VR, especially with respect to user experience and detailed content search and suggest a mixed interface design providing both options to the users. Supplemental materials about the user study and used data are available at https://osf.io/5vk49/.

  • Conference Article
  • Cite Count Icon 27
  • 10.1109/iccv.2017.360
Beyond Standard Benchmarks: Parameterizing Performance Evaluation in Visual Object Tracking
  • Oct 1, 2017
  • Luka Cehovin Zajc + 3 more

Object-to-camera motion produces a variety of apparent motion patterns that significantly affect performance of short-term visual trackers. Despite being crucial for designing robust trackers, their influence is poorly explored in standard benchmarks due to weakly defined, biased and overlapping attribute annotations. In this paper we propose to go beyond pre-recorded benchmarks with post-hoc annotations by presenting an approach that utilizes omnidirectional videos to generate realistic, consistently annotated, short-term tracking scenarios with exactly parameterized motion patterns. We have created an evaluation system, constructed a fully annotated dataset of omnidirectional videos and generators for typical motion patterns. We provide an in-depth analysis of major tracking paradigms which is complementary to the standard benchmarks and confirms the expressiveness of our evaluation approach.

  • Conference Article
  • Cite Count Icon 228
  • 10.1145/3083187.3084016
Towards Bandwidth Efficient Adaptive Streaming of Omnidirectional Video over HTTP
  • Jun 20, 2017
  • Mario Graf + 2 more

Real-time entertainment services such as streaming audiovisual content deployed over the open, unmanaged Internet account now for more than 70% during peak periods. More and more such bandwidth hungry applications and services are proposed like immersive media services such as virtual reality and, specifically omnidirectional/360-degree videos. The adaptive streaming of omnidirectional video over HTTP imposes an important challenge on today's video delivery infrastructures which calls for dedicated, thoroughly designed techniques for content generation, delivery, and consumption.; AB@This paper describes the usage of tiles --- as specified within modern video codecs such HEVC/H.265 and VP9 --- enabling bandwidth efficient adaptive streaming of omnidirectional video over HTTP and we define various streaming strategies. Therefore, the parameters and characteristics of a dataset for omnidirectional video are proposed and exemplary instantiated to evaluate various aspects of such an ecosystem, namely bitrate overhead, bandwidth requirements, and quality aspects in terms of viewport PSNR. The results indicate bitrate savings from 40% (in a realistic scenario with recorded head movements from real users) up to 65% (in an ideal scenario with a centered/fixed viewport) and serve as a baseline and guidelines for advanced techniques including the outline of a research roadmap for the near future.

  • Conference Article
  • Cite Count Icon 22
  • 10.1109/iccvw.2019.00446
The Vid3oC and IntVID Datasets for Video Super Resolution and Quality Mapping
  • Oct 1, 2019
  • Sohyeong Kim + 6 more

The current rapid advancements of computational hardware has opened the door for deep networks to be applied for real-time video processing, even on consumer devices. Appealing tasks include video super-resolution, compression artifact removal, and quality enhancement. These problems require high-quality datasets that can be applied for training and benchmarking. In this work, we therefore introduce two video datasets, aimed for a variety of tasks. First, we propose the Vid3oC dataset, containing 82 simultaneous recordings of 3 camera sensors. It is recorded with a multi-camera rig, including a high-quality DSLR camera, a high-end smartphone, and a stereo camera sensor. Second, we introduce the IntVID dataset, containing over 150 high-quality videos crawled from the internet. The datasets were employed for the AIM 2019 challenges for video super-resolution and quality mapping.

  • Conference Article
  • Cite Count Icon 21
  • 10.1109/wacv45572.2020.9093283
Weakly-Supervised Multi-Person Action Recognition in 360° Videos
  • Mar 1, 2020
  • Junnan Li + 4 more

The recent development of commodity 360° cameras have enabled a single video to capture an entire scene, which endows promising potentials in surveillance scenarios. However, research in omnidirectional video analysis has lagged behind the hardware advances. In this work, we address the important problem of action recognition in topview 360° videos. Due to the wide filed-of-view, 360° videos usually capture multiple people performing actions at the same time. Furthermore, the appearance of people are deformed. The proposed framework first transforms top-view omnidirectional videos into panoramic videos using a calibrationfree method. Then spatial-temporal features are extracted using region-based 3D CNNs for action recognition. We propose a weakly-supervised method based on multiinstance multi-label learning, which trains the model to recognize and localize multiple actions in a video using only video-level action labels as supervision. We perform experiments to quantitatively validate the efficacy of the proposed method over state-of-the-art baselines and variants of our model, and qualitatively demonstrate action localization results. To enable research in this direction, we introduce the 360Action dataset. It is the first omnidirectional video dataset for multi-person action recognition with a diverse set of scenes, actors and actions. The dataset is available at https://github.com/ryukenzen/360action.

  • Conference Article
  • Cite Count Icon 37
  • 10.1109/cvprw53098.2021.00075
NTIRE 2021 Challenge on Quality Enhancement of Compressed Video: Methods and Results
  • Jun 1, 2021
  • Ren Yang + 71 more

This paper reviews the first NTIRE challenge on quality enhancement of compressed video, with focus on proposed solutions and results. In this challenge, the new Large-scale Diverse Video (LDV) dataset is employed. The challenge has three tracks. Tracks 1 and 2 aim at enhancing the videos compressed by HEVC at a fixed QP, while Track 3 is de-signed for enhancing the videos compressed by x265 at a fixed bit-rate. Besides, the quality enhancement of Tracks 1 and 3 targets at improving the fidelity (PSNR), and Track 2 targets at enhancing the perceptual quality. The three tracks totally attract 482 registrations. In the test phase, 12 teams, 8 teams and 11 teams submitted the final results of Tracks 1, 2 and 3, respectively. The proposed methods and solutions gauge the state-of-the-art of video quality enhancement. The homepage of the challenge: https://github.com/RenYang-home/NTIRE21_VEnh

  • Conference Article
  • Cite Count Icon 29
  • 10.1109/cvprw53098.2021.00076
NTIRE 2021 Challenge on Quality Enhancement of Compressed Video: Dataset and Study
  • Jun 1, 2021
  • Ren Yang + 1 more

This paper introduces a novel dataset for video enhancement and studies the state-of-the-art methods of the NTIRE 2021 challenge on quality enhancement of com-pressed video. The challenge is the first NTIRE challenge in this direction, with three competitions, hundreds of participants and tens of proposed solutions. Our newly collected Large-scale Diverse Video (LDV) dataset is employed in the challenge. In our study, we analyze the solutions of the challenges and several representative methods from previous literature on the proposed LDV dataset. We find that the NTIRE 2021 challenge advances the state-of-the-art of quality enhancement on compressed video. The pro-posed LDV dataset is publicly available at the homepage of the challenge: https://github.com/RenYang-home/NTIRE21_VEnh

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/robot.2005.1570757
Learning to Track Multiple People in Omnidirectional Video
  • Apr 18, 2005
  • F De La Torre + 4 more

Meetings are a very important part of everyday life for professionals working in universities, companies or governmental institutions.We have designed a physical awareness system called CAMEO (Camera Assisted Meeting Event Observer), a hardware/software system to record and monitor people’s activities in meetings. CAMEO captures a high resolution omnidirectional view of the meeting by stitching images coming from almost concentric cameras. Besides recording capability, CAMEO automatically detects people and learns a person-specific facial appearance model (PSFAM) for each of the participants. The PSFAMs allow more robust/reliable tracking and identification. In this paper, we describe the video-capturing device, photometric/geometric autocalibration process, and the multiple people tracking system. The effectiveness and robustness of the proposed system is demonstrated over several real-time experiments and a large data set of videos.

Save Icon
Up Arrow
Open/Close