Aerial View River Landform Video Segmentation: A Weakly Supervised Context-Aware Temporal Consistency Distillation Approach

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely $30 \%$ of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo: https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection

Similar Papers
  • Conference Article
  • Cite Count Icon 9
  • 10.1145/3394171.3413788
Temporal Denoising Mask Synthesis Network for Learning Blind Video Temporal Consistency
  • Oct 12, 2020
  • Yifeng Zhou + 5 more

Recently, developing temporally consistent video-based processing techniques has drawn increasing attention due to the defective extend-ability of existing image-based processing algorithms (e.g., filtering, enhancement, colorization, etc). Generally, applying these image-based algorithms independently to each video frame typically leads to temporal flickering due to the global instability of these algorithms. In this paper, we consider enforcing temporal consistency in a video as a temporal denoising problem that removing the flickering effect in given unstable pre-processed frames. Specifically, we propose a novel model termed Temporal Denoising Mask Synthesis Network (TDMS-Net) that jointly predicts the motion mask, soft optical flow and the refining mask to synthesize the temporal consistent frames. The temporal consistency is learned from the original video and the learned temporal features are applied to reprocess the output frames that are agnostic (blind) to specific image-based processing algorithms. Experimental results on two datasets for 16 different applications demonstrate that the proposed TDMS-Net significantly outperforms two state-of-the-art blind temporal consistency approaches.

  • Research Article
  • 10.1145/3763348
Lightweight, Edge-Aware, and Temporally Consistent Supersampling for Mobile Real-Time Rendering
  • Dec 1, 2025
  • ACM Transactions on Graphics
  • Sipeng Yang + 9 more

Supersampling has proven highly effective in enhancing visual fidelity by reducing aliasing, increasing resolution, and generating interpolated frames. It has become a standard component of modern real-time rendering pipelines. However, on mobile platforms, deep learning-based supersampling methods remain impractical due to stringent hardware constraints, while non-neural supersampling techniques often fall short in delivering perceptually high-quality results. In particular, producing visually pleasing reconstructions and temporally coherent interpolations is still a significant challenge in mobile settings. In this work, we present a novel, lightweight supersampling framework tailored for mobile devices. Our approach substantially improves both image reconstruction quality and temporal consistency while maintaining real-time performance. For super-resolution, we propose an intra-pixel object coverage estimation method for reconstructing high-quality anti-aliased pixels in edge regions, a gradient-guided strategy for non-edge areas, and a temporal sample accumulation approach to improve overall image quality. For frame interpolation, we develop an efficient motion estimation module coupled with a lightweight fusion scheme that integrates both estimated optical flow and rendered motion vectors, enabling temporally coherent interpolation of object dynamics and lighting variations. Extensive experiments demonstrate that our method consistently outperforms existing baselines in both perceptual image quality and temporal smoothness, while maintaining real-time performance on mobile GPUs. A demo application and supplementary materials are available on the project page.

  • Research Article
  • Cite Count Icon 11
  • 10.1609/aaai.v36i2.20032
Hybrid Instance-Aware Temporal Fusion for Online Video Instance Segmentation
  • Jun 28, 2022
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Xiang Li + 3 more

Recently, transformer-based image segmentation methods have achieved notable success against previous solutions. While for video domains, how to effectively model temporal context with the attention of object instances across frames remains an open problem. In this paper, we propose an online video instance segmentation framework with a novel instance-aware temporal fusion method. We first leverage the representation, \ie, a latent code in the global context (instance code) and CNN feature maps to represent instance- and pixel-level features. Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames. Specifically, we encode global instance-specific information in the instance code and build up inter-frame contextual fusion with hybrid attentions between the instance codes and CNN feature maps. Inter-frame consistency between the instance codes is further enforced with order constraints. By leveraging the learned hybrid temporal consistency, we are able to directly retrieve and maintain instance identities across frames, eliminating the complicated frame-wise instance matching in prior methods. Extensive experiments have been conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. Notably, our model also eclipses all offline methods when using the ResNet-50 backbone.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icra46639.2022.9812382
Stable 3D Human Pose Estimation in Low- Resolution Videos with a Few Views
  • May 23, 2022
  • Chihiro Nakatsuka + 1 more

We discuss the problem of 3D pose estimation for multi-view videos. With previous frame-by-frame multi-view methods, it has been difficult to achieve stable estimation under challenging settings such as low-resolution or with only a few views. Temporal approaches are effective ways of addressing such problems, but enforcing temporal consistency with neigh-boring frames sometimes damages the precision of the results. We propose a temporal approach with selective corrections based on the observation that errors in the frame-by-frame approach are concentrated under certain adverse conditions. Our method evaluates the confidence of the frame-by-frame results and compensates for the inaccurate keypoints with temporal information while retaining the accurate keypoints. In our experiments on the CMU Panoptic dataset customized for low-resolution and a few views, we reported 32.98 mm for MPJPE and 98.64% for 3D-PCK@150. Compared to the state-of-the-art method, our method improved MPJPE by 1.14 mm and corrected 16 % of incorrect keypoints.

  • Book Chapter
  • Cite Count Icon 29
  • 10.1007/978-3-319-62398-6_48
Automatic Detection of a Driver’s Complex Mental States
  • Jan 1, 2017
  • Zhiyi Ma + 4 more

Automatic classification of drivers’ mental states is an important yet relatively unexplored topic. In this paper, we define a taxonomy of a set of complex mental states that are relevant to driving, namely: Happy, Bothered, Concentrated and Confused. We present our video segmentation and annotation methodology of a spontaneous dataset of natural driving videos from 10 different drivers. We also present our real-time annotation tool used for labelling the dataset via an emotion perception experiment and discuss the challenges faced in obtaining the ground truth labels. Finally, we present a methodology for automatic classification of drivers’ mental states. We compare SVM models trained on our dataset with an existing nearest neighbour model pre-trained on posed dataset, using facial Action Units as input features. We demonstrate that our temporal SVM approach yields better results. The dataset’s extracted features and validated emotion labels, together with the annotation tool, will be made available to the research community.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/icip.2016.7533100
Multi-view semantic temporal video segmentation
  • Sep 1, 2016
  • Thomas Theodoridis + 2 more

In this work, we propose a multi-view temporal video segmentation approach that employs a Gaussian scoring process for determining the best segmentation positions. By exploiting the semantic action information that the dense trajectories video description offers, this method can detect intra-shot actions as well, unlike shot boundary detection approaches. We compare the temporal segmentation results of the proposed method to both single-view and multi-view methods, and also compare the action recognition results obtained on ground truth video segments to the ones obtained on the proposed multi-view segments, on the IMPART multi-view action data set.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/icme.2001.1237937
A novel block-based video segmentation algorithm
  • Jan 1, 2001
  • L Atzori + 2 more

This paper presents a new technique for video segmentation and tracking. As most of segmentation techniques it consists on an initial model generation process with a subsequent object tracking phase. The model generation is accomplished by means of a combination of temporal and spatial transitions detection approaches. The novelty of the method is that these approaches are performed block-by-block. This has the advantages to reduce problems relevant to the object connectivity and to drastically decrease the algorithm computational complexity respect to a pixel-by-pixel processing procedure. According to the proposed strategy, edged blocks are firstly extracted in an active region selected by the user. From these, the subset of blocks that represents the object contour is selected by minimizing a cost function that exploits a multiple edged block feature vector: motion, smoothness, continuity, strongness and position. The tracking task is then performed by estimating the model blocks motion. Experiments are presented that show comparable results accuracy respect to existing segmentation techniques while requiring a reduced computational complexity.

  • Conference Article
  • Cite Count Icon 30
  • 10.1109/wacv.2015.145
Real-Time Facial Expression Recognition on Smartphones
  • Jan 1, 2015
  • Myunghoon Suk + 1 more

Temporal segmentation of real time video is an important part for automatic facial expression recognition system. Many studies for facial expression recognition have been carried out under restricted experimental environment such as pre-segmented video set. In this paper, we present a real-time temporal video segmenting approach for automatic facial expression recognition applicable in a smartphone. The proposed system uses a Finite State Machine (FSM) for segmenting real time video into temporal phases from neutral expression to the peak of an expression. The FSM uses Lucas-Kanade's optical flow vector based scores for state transitions to adapt the varying speeds of facial expressions. While even HMM based or hybrid HMM model based approaches handling time series data require sampling times, the proposed system runs without any sampling time delay. The proposed system performs facial expression recognition with Support Vector Machines (SVM) on every apex state after automatic temporal segmentation. The mobile app with our approach runs on Samsung Galaxy S3 with 3.7 fps and the accuracy of real-time mobile emotion recognition is about 70.6% for 6 basic emotions by 5 subjects who are not professional actors.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.cmpb.2025.108782
A temporal convolutional network-based approach and a benchmark dataset for colonoscopy video temporal segmentation.
  • Oct 1, 2025
  • Computer methods and programs in biomedicine
  • Carlo Biffi + 3 more

A temporal convolutional network-based approach and a benchmark dataset for colonoscopy video temporal segmentation.

  • Research Article
  • 10.3389/fphys.2025.1629121
Echo-ODE: A dynamics modeling network with neural ODE for temporally consistent segmentation of video echocardiograms
  • Aug 18, 2025
  • Frontiers in Physiology
  • Wenliang Lu + 5 more

IntroductionSegmentation of echocardiograms plays a crucial role in clinical diagnosis. Beyond accuracy, a major challenge of video echocardiogram analysis is the temporal consistency of consecutive frames. Stable and consistent segmentation of cardiac structures is essential for a reliable fully automatic echocardiogram interpretation.MethodsWe propose a novel framework Echo-ODE, where the heart is regarded as a dynamical system, and we model the representation of dynamics by neural ordinary differential equations. Echo-ODE learns the spatio-temporal relationships of the input video and output continuous and consistent predictions.ResultsExperiments conducted on the Echo-Dynamic, the CAMUS and our private dataset demonstrate that Echo-ODE achieves comparable accuracy but significantly better temporal stability and consistency in video segmentation than previous mainstream CNN models. More accurate phase detection and robustness to arrhythmia also underscore the superiority of our proposed model.DiscussionEcho-ODE addresses the critical need for temporal coherence in clinical video analysis. This framework establishes a versatile backbone extendable beyond segmentation tasks. Its ability to model cardiac dynamics demonstrates great potential for enabling reliable, fully automated video echocardiogram interpretation. The code is publicly available at https://github.com/luwenlianglu/EchoODE.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-3-319-49409-8_65
Point-Wise Mutual Information-Based Video Segmentation with High Temporal Consistency
  • Jan 1, 2016
  • Margret Keuper + 1 more

In this paper, we tackle the problem of temporally consistent boundary detection and hierarchical segmentation in videos. While finding the best high-level reasoning of region assignments in videos is the focus of much recent research, temporal consistency in boundary detection has so far only rarely been tackled. We argue that temporally consistent boundaries are a key component to temporally consistent region assignment. The proposed method is based on the point-wise mutual information (PMI) of spatio-temporal voxels. Temporal consistency is established by an evaluation of PMI-based point affinities in the spectral domain over space and time. Thus, the proposed method is independent of any optical flow computation or previously learned motion models. The proposed low-level video segmentation method outperforms the learning-based state of the art in terms of standard region metrics.

  • Conference Article
  • Cite Count Icon 18
  • 10.1109/wacv51458.2022.00268
Perceptual Consistency in Video Segmentation
  • Jan 1, 2022
  • Yizhe Zhang + 6 more

In this paper, we present a novel perceptual consistency perspective on video semantic segmentation, which can capture both temporal consistency and pixel-wise correctness. Given two nearby video frames, perceptual consistency measures how much the segmentation decisions agree with the pixel correspondences obtained via matching general perceptual features. More specifically, for each pixel in one frame, we find the most perceptually correlated pixel in the other frame. Our intuition is that such a pair of pixels are highly likely to belong to the same class. Next, we assess how much the segmentation agrees with such perceptual correspondences, based on which we derive the perceptual consistency of the segmentation maps across these two frames. Utilizing perceptual consistency, we can evaluate the temporal consistency of video segmentation by measuring the perceptual consistency over consecutive pairs of segmentation maps in a video. Furthermore, given a sparsely labeled test video, perceptual consistency can be utilized to aid with predicting the pixel-wise correctness of the segmentation on an unlabeled frame. More specifically, by measuring the perceptual consistency between the predicted segmentation and the available ground truth on a nearby frame and combining it with the segmentation confidence, we can accurately assess the classification correctness on each pixel. Our experiments show that the proposed perceptual consistency can more accurately evaluate the temporal consistency of video segmentation as compared to flow-based measures. Furthermore, it can help more confidently predict segmentation accuracy on unlabeled test frames, as compared to using classification confidence alone. Finally, our proposed measure can be used as a regularizer during the training of segmentation models, which leads to more temporally consistent video segmentation while maintaining accuracy.

  • Conference Article
  • Cite Count Icon 31
  • 10.1109/wacv51458.2022.00269
AuxAdapt: Stable and Efficient Test-Time Adaptation for Temporally Consistent Video Semantic Segmentation
  • Jan 1, 2022
  • Yizhe Zhang + 3 more

In video segmentation, generating temporally consistent results across frames is as important as achieving frame-wise accuracy. This paper presents an efficient, intuitive, and unsupervised online adaptation method, AuxAdapt, for improving the temporal consistency of most neural network models. It does not require optical flow and only takes one pass of the video. Since inconsistency mainly arises from the model’s uncertainty in its output, we propose an adaptation scheme where the model learns from its own segmentation decisions as it streams a video, which allows producing more confident and temporally consistent labeling for similarly-looking pixels across frames. For stability and efficiency, we leverage a small auxiliary segmentation network (AuxNet) to assist with this adaptation. More specifically, AuxNet readjusts the decision of the original segmentation network (Main-Net) by adding its own estimations to that of MainNet. At every frame, only AuxNet is updated via back-propagation while keeping MainNet fixed. We extensively evaluate our test-time adaptation approach on standard video benchmarks, including Cityscapes, CamVid, and KITTI. The results demonstrate that our approach provides label-wise accurate, temporally consistent, and computationally efficient adaptation.

  • Conference Article
  • Cite Count Icon 37
  • 10.1109/cvprw50498.2020.00176
Unsupervised Temporal Consistency Metric for Video Segmentation in Highly-Automated Driving
  • Jun 1, 2020
  • Serin Varghese + 9 more

Commonly used metrics to evaluate semantic segmentation such as mean intersection over union (mIoU) do not incorporate temporal consistency. A straightforward extension of existing metrics towards evaluating the consistency of segmentation of video sequences does not exist, since labelled videos are rare and very expensive to obtain. For safety-critical applications such as highly automated driving, there is, however, a need for a metric that measures such temporal consistency of video segmentation networks to possibly support safety requirements. In this paper, (a) we introduce a metric which does not require segmentation labels for measuring the stability of the predictions of segmentation networks over a series of images; (b) we perform an in-depth analysis of the proposed metric and observe strong correlations to the supervised mIoU metric; (c) we perform an evaluation of five state-of-the-art networks for semantic segmentation of varying complexities and architectures evaluated on two public datasets, namely, Cityscapes and CamVid. Finally, we perform timing evaluations and propose the use of the metric as either an online observer for identification of possibly unstable segmentation predictions, or as an offline method to evaluate or to improve semantic segmentation networks, e.g., by selecting additional training data with critical temporal consistency.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/cerma.2006.10
A Relaxed Temporal Consistency Approach for Real-Time Concurrency Control
  • Sep 1, 2006
  • Alejandro Ibarra + 1 more

This work presents a real time distributed concurrency control scheme where consistency is relaxed to let reading tasks to run as well as concurrent writing tasks. The algorithm reduces the possibility of deadline missing by increasing the level of concurrency in systems where the proportion for read operations is greater than for write operations. The proposed algorithm is a variant for an existent concurrency control algorithm which is faster in the sense that low time responses are offered by the new algorithm.

Save Icon
Up Arrow
Open/Close