Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

NaviFormer: Multimodal scene segmentation for assistive navigation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

NaviFormer: Multimodal scene segmentation for assistive navigation

Similar Papers
  • Conference Article
  • Cite Count Icon 21
  • 10.1145/1631272.1631383
Multi-modal scene segmentation using scene transition graphs
  • Oct 19, 2009
  • Panagiotis Sidiropoulos + 4 more

In this work the problem of automatic decomposition of video into elementary semantic units, known in the literature as scenes, is addressed. Two multi-modal automatic scene segmentation techniques are proposed, both building upon the Scene Transition Graph (STG). In the first of the proposed approaches, speaker diarization results are used for introducing a post-processing step to the STG construction algorithm, with the objective of discarding scene boundaries erroneously identified according to visual-only dissimilarity. In the second approach, speaker diarization and additional audio analysis results are employed and a separate audio-based STG is constructed, in parallel to the original STG based on visual information. The two STGs are subsequently combined. Preliminary results from the application of the proposed techniques to broadcast videos reveal their improved performance over previous approaches.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ichci54629.2021.00019
A Video Scene Segmentation Optimization Algorithm Based on Convolutional Neural Network
  • Nov 1, 2021
  • Qing Huang + 2 more

Currently, video scene segmentation is an important part of realizing content-based video retrieval (CBVR). Aiming at the problem that low efficiency of video scene segmentation in CBVR, this paper proposed a multi-modal video scene segmentation optimization algorithm based on feature extraction of convolutional neural network (CNN). According to the large amount of information contained in the multi-modal data of video, the VGG19 network has been improved in a targeted manner and the underlying features and semantic features of various modes are extracted from each video shots. By forming these features into vectors and using the method such as triplet loss learning and shot similarity calculation, scene segmentation task is converted to a binary classification problem for shot boundary. Then the scoring mechanism is established to optimize the results, finally the scene segmentation task is completed. Experimental results show that the algorithm can be effective in video scene segmentation, and the overall recall and precision can reach 85.77% and 87.01%, respectively. Compared with the shot similarity graph method, two indicators have increased by 10% and 9% respectively. Compared with the DeepSSS method that also uses the deep learning network model, the comprehensive metric F-messure has increased by 8%.

  • Research Article
  • Cite Count Icon 11
  • 10.1007/s11042-018-6959-4
Correlation based feature fusion for the temporal video scene segmentation task
  • Dec 7, 2018
  • Multimedia Tools and Applications
  • Rodrigo Mitsuo Kishi + 2 more

The available automatic temporal video scene segmentation methods still lack efficacy to be employed in most practical multimedia systems. The ones showing better results are multimodal and based on late fusion. On the other hand, early fusion has not been sufficiently investigated in this task because of the well known barriers of this approach: correlation identification, temporal synchronization and unique representation. This work presents a feature fusion method which deals with the mentioned difficulties and produces features which can enhance the efficacy of existing temporal video scene segmentation methods. This feature fusion process is performed on singlemodal Bag of Features feature vectors and is intended to enrich previously captured latent semantics by performing temporal clustering of features, providing an unified representation of multiple temporal related features. This feature fusion process have been coupled with two of-the-shelf scene segmentation algorithms, presenting competitive results when compared with two other state-of-the-art multimodal temporal scene segmentation methods. The results indicate that the proposed early fusion feature representation method is a promising alternative in helping to boost video retrieval related tasks.

  • Research Article
  • 10.4028/www.scientific.net/amm.513-517.514
Multi-Modality Video Scene Segmentation Algorithm with Shot Force Competition
  • Feb 6, 2014
  • Applied Mechanics and Materials
  • Yun Zhu Xiang

In order to quickly and effectively segment the video scene, a multi-modality video scene segmentation algorithm with shot force competition is proposed in this paper. This method is take full account of temporal associated co-occurrence of multimodal media data, to calculate the similarity between video shot by merging the video low-level features, then go to the video scene segmentation based on the judgment method of shot competition. The authors experiments show that the video scene can be efficiently separated by the method proposed in the paper.

  • Conference Article
  • Cite Count Icon 2
  • 10.1145/2526188.2526202
Multimodal late fusion bag of features applied to scene detection
  • Nov 5, 2013
  • Bruno Lorenço Lopes + 1 more

Recent advances in technology have increased the availability of video data, creating a strong requirement for efficient systems to manage those materials. To make efficient use of video information, first, the data has to be automatic segmented into smaller, manageable and understandable units, like scenes. This paper presents a new, multimodal video scene segmentation technique. The proposed approach is to combine Bag of Features based techniques (visual and aural) in order to explore the latent semantic obtained by them in complementary way, improving scene segmentation. The results achieved showed to be promising.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1201/9781003387374-35
Lightweight deep learning model for multimodal material segmentation in road environment scenes
  • Nov 15, 2023
  • Jinhuan Shan + 2 more

Fusion of multimodal data can effectively improve the perception ability of road infrastructure ontology. In this paper, a lightweight deep learning neural network is proposed to study the fusion segmentation effect of multimodal images under visible light, infrared light, and polarized light. The results showed that different modalities have different effects on the segmentation of different road materials. Especially for the recognition of road water, the segmentation effect was improved by 35.6% after fusing AoLP (angle of linear polarization) images. By using multimodal fusion segmentation, the mIoU (mean intersection over union) index was improved by 4.2% compared to ordinary RGB images.

  • Research Article
  • 10.1007/s00530-025-01941-z
Multi-modal semi-supervised semantic segmentation for indoor scenes via adaptive CutMix and contrastive learning
  • Aug 1, 2025
  • Multimedia Systems
  • Xueqiang Lyu + 5 more

Multi-modal semi-supervised semantic segmentation for indoor scenes via adaptive CutMix and contrastive learning

  • Research Article
  • Cite Count Icon 162
  • 10.1109/tcsvt.2011.2138830
Temporal Video Segmentation to Scenes Using High-Level Audiovisual Features
  • Aug 1, 2011
  • IEEE Transactions on Circuits and Systems for Video Technology
  • P Sidiropoulos + 5 more

In this paper, a novel approach to video temporal decomposition into semantic units, termed scenes, is presented. In contrast to previous temporal segmentation approaches that employ mostly low-level visual or audiovisual features, we introduce a technique that jointly exploits low-level and high-level features automatically extracted from the visual and the auditory channel. This technique is built upon the well-known method of the scene transition graph (STG), first by introducing a new STG approximation that features reduced computational cost, and then by extending the unimodal STG-based temporal segmentation technique to a method for multimodal scene segmentation. The latter exploits, among others, the results of a large number of TRECVID-type trained visual concept detectors and audio event detectors, and is based on a probabilistic merging process that combines multiple individual STGs while at the same time diminishing the need for selecting and fine-tuning several STG construction parameters. The proposed approach is evaluated on three test datasets, comprising TRECVID documentary films, movies, and news-related videos, respectively. The experimental results demonstrate the improved performance of the proposed approach in comparison to other unimodal and multimodal techniques of the relevant literature and highlight the contribution of high-level audiovisual features toward improved video segmentation to scenes.

  • Conference Article
  • Cite Count Icon 3
  • 10.1117/12.2587991
Evaluation of multimodal semantic segmentation using RGB-D data
  • Apr 12, 2021
  • Jiesi Hu + 3 more

Our goal is to develop stable, accurate, and robust semantic scene understanding methods for wide-area scene perception and understanding, especially in challenging outdoor environments. To achieve this, we are exploring and evaluating a range of related technology and solutions, including AI-driven multimodal scene perception, fusion, processing, and understanding. This work reports our efforts on the evaluation of a state-of-the-art approach for semantic segmentation with multiple RGB and depth sensing data. We employ four large datasets composed of diverse urban and terrain scenes and design various experimental methods and metrics. In addition, we also develop new strategies of multi-datasets learning to improve the detection and recognition of unseen objects. Extensive experiments, implementations, and results are reported in the paper.

  • Research Article
  • 10.5302/j.icros.2023.22.0234
실내 환경에서 멀티-뷰 RGB-D 영상들을 활용하는 3차원 의미적 장면 분할
  • Mar 31, 2023
  • Journal of Institute of Control, Robotics and Systems
  • Hye-Lim Bae + 1 more

This paper proposes a novel model for 3D semantic scene segmentation in indoor environments. Existing models for 3D semantic scene segmentation use either only 3D geometric features of the scene point cloud or only 2D visual features of RGB color images. We overcome the limitations of existing models and improve the performance of 3D semantic scene segmentation by proposing a multimodal 3D semantic scene segmentation model to use both 3D geometric features of the scene point cloud and rich 2D visual features of multi-view color images. The proposed model overcomes the point sparsity problem by using the dense point cloud obtained from multi-view depth images and uses an adaptive point feature extractor to extract 3D geometric features representing the local structural characteristics of points. Moreover, the model adopts a unique early fusion strategy to fuse the 2D-3D features. Based on experiments conducted using the ScanNet benchmark dataset, we demonstrate the effectiveness and superiority of the proposed model.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/itaic.2019.8785474
Virtual Reality Scene Construction Based on Multimodal Video Scene Segmentation Algorithm
  • May 1, 2019
  • Rui Wang + 4 more

Video scene segmentation has become one of the research hotspots in the video field because of its important role in improving retrieval accuracy, and plays a very important role in the construction of virtual scenes. In order to realize fast and accurate video scene segmentation, this paper proposes a multi-modal video scene segmentation algorithm based on ant colony algorithm. The algorithm extracts the physical features of different modes in key frames based on the idea of multi-modal feature fusion. The similarity between the same modal data and the correlation of different modal data are combined, and the similarity between different lenses is calculated. The lens similarity matrix is constructed and the ant colony algorithm is used to segment the video scene. The experimental data proves that the algorithm has a good segmentation effect on the video scene.

  • Research Article
  • Cite Count Icon 3
  • 10.1155/2022/1264847
Interactive Design of Business English Learning Resources Based on EDIPT Multimodal Model
  • Sep 8, 2022
  • Computational Intelligence and Neuroscience
  • Xiaomei Yang + 1 more

Aiming at the problem that online video learning resources of business English are scattered and the learners are inefficient in acquiring learning resources, this paper designed a business English learning system based on the EDIPT model. In addition, aiming at the problem of multifeature fusion between low-level features and high-level semantic features in video scenes, this paper proposes a multi-modal video scene segmentation algorithm based on a deep network. By minimizing the square sum of distances in the time period, the shots are clustered, and finally, the semantic scene is obtained. The experimental results show that the algorithm has good performance in classification accuracy and can effectively segment video scenes, which is helpful for users to improve their comprehensive business English skills.

  • Dissertation
  • 10.11606/t.55.2019.tde-28082019-110926
Um método de segmentação de vídeo em cenas baseado em aprendizagem profunda
  • Jan 1, 2019
  • Tiago Henrique Trojahn

Automatic video scene segmentation is a current and relevant problem given its application in various services related to multimedia. Among the different techniques reported in the literature, the multimodal ones are considered more promising, given the ability to extract information from different media in a potentially complementary way, allowing for more significant segmentations. By processing information of different natures, such techniques faces difficulties on modeling and obtaining a combined representation of information and cost problems when processing each source of information individually. Finding a suitable combination of information that increases the effectiveness of segmentation at a relatively low computational cost becomes a challenge. At the same time, approaches based on Deep Learning have proven effective on a wide range of tasks, including classification of images and video. Techniques based on Deep Learning, such as Convolutional Neural Networks (CNNs), have achieved impressive results in related tasks by being able to extract significant patterns from data, including multimodal data. However, CNNs can not properly learn the relationships between data temporarily distributed among the shots of the same scene. This can lead the network to become unable to properly segment scenes whose characteristics change among shots. On the other hand, Recurrent Neural Networks (RNNs) have been successfully employed in textual processing since they are designed to analyze variable-length data sequences and can be developed to better explore the temporal relationships between low-level characteristics of related shots, potentially increasing the effectiveness of scene segmentation. There is a lack of multimodal segmentation methods exploring Deep Learning. Thus, this thesis proposes an automatic method for video scene segmentation that models the problem of segmentation as a classification problem. The method relies on a model developed to combine the potential for extracting patterns from CNNs with the potential for sequence processing of the RNNs. The proposed model, different from related works, eliminates the difficulty of modeling multimodal representations of the different input information, besides allowing to instantiate different approaches for multimodal (early or late) fusion. This method was evaluated in the scene segmentation task using a public video database, comparing the results obtained with the results of state-of-the-art techniques using different approaches. The results show a significant advance in the efficiency obtained.

  • Research Article
  • Cite Count Icon 12
  • 10.1109/tits.2024.3454597
Real-Time Multi-Scene Visibility Enhancement for Promoting Navigational Safety of Vessels Under Complex Weather Conditions
  • Dec 1, 2024
  • IEEE Transactions on Intelligent Transportation Systems
  • Ryan Wen Liu + 6 more

The visible-light camera, which is capable of environment perception and navigation assistance, has emerged as an essential imaging sensor for marine surface vessels in intelligent waterborne transportation systems (IWTS). However, the visual imaging quality inevitably suffers from several kinds of degradations (e.g., limited visibility, low contrast, color distortion, etc.) under complex weather conditions (e.g., haze, rain, and low-lightness). The degraded visual information will accordingly result in inaccurate environment perception and delayed operations for navigational risk. To promote the navigational safety of vessels, many computational methods have been presented to perform visual quality enhancement under poor weather conditions. However, most of these methods are essentially specific-purpose implementation strategies, only available for one specific weather type. To overcome this limitation, we propose to develop a general-purpose multi-scene visibility enhancement method, i.e., edge reparameterization- and attention-guided neural network (ERANet), to adaptively restore the degraded images captured under different weather conditions. In particular, our ERANet simultaneously exploits the channel attention, spatial attention, and reparameterization technology to enhance the visual quality while maintaining low computational cost. Extensive experiments conducted on standard and IWTS-related datasets have demonstrated that our ERANet could outperform several representative visibility enhancement methods in terms of both imaging quality and computational efficiency. The superior performance of IWTS-related object detection and scene segmentation could also be steadily obtained after ERANet-based visibility enhancement under complex weather conditions.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icme.2012.167
Scene Segmentation and Pedestrian Classification from 3-D Range and Intensity Images
  • Jul 1, 2012
  • Xue Wei + 2 more

This paper proposes a new approach to classify obstacles using a time-of-flight camera, for applications in assistive navigation of the visually impaired. Combining range and intensity images enables fast and accurate object segmentation, and provides useful navigation cues such as distances to the nearby obstacles and obstacle types. In the proposed approach, a 3-D range image is first segmented using histogram thresholding and mean-shift grouping. Then Fourier and GIST descriptors are applied on each segmented object to extract shape and texture features. Finally, support vector machines are used to recognize the obstacles. This paper focuses on classifying pedestrian and non-pedestrian obstacles. Evaluated on an image data set acquired using a time-of-flight camera, the proposed approach achieves a classification rate of 99.5%.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant