NaviFormer: Multimodal scene segmentation for assistive navigation
NaviFormer: Multimodal scene segmentation for assistive navigation
- Conference Article
21
- 10.1145/1631272.1631383
- Oct 19, 2009
In this work the problem of automatic decomposition of video into elementary semantic units, known in the literature as scenes, is addressed. Two multi-modal automatic scene segmentation techniques are proposed, both building upon the Scene Transition Graph (STG). In the first of the proposed approaches, speaker diarization results are used for introducing a post-processing step to the STG construction algorithm, with the objective of discarding scene boundaries erroneously identified according to visual-only dissimilarity. In the second approach, speaker diarization and additional audio analysis results are employed and a separate audio-based STG is constructed, in parallel to the original STG based on visual information. The two STGs are subsequently combined. Preliminary results from the application of the proposed techniques to broadcast videos reveal their improved performance over previous approaches.
- Conference Article
1
- 10.1109/ichci54629.2021.00019
- Nov 1, 2021
Currently, video scene segmentation is an important part of realizing content-based video retrieval (CBVR). Aiming at the problem that low efficiency of video scene segmentation in CBVR, this paper proposed a multi-modal video scene segmentation optimization algorithm based on feature extraction of convolutional neural network (CNN). According to the large amount of information contained in the multi-modal data of video, the VGG19 network has been improved in a targeted manner and the underlying features and semantic features of various modes are extracted from each video shots. By forming these features into vectors and using the method such as triplet loss learning and shot similarity calculation, scene segmentation task is converted to a binary classification problem for shot boundary. Then the scoring mechanism is established to optimize the results, finally the scene segmentation task is completed. Experimental results show that the algorithm can be effective in video scene segmentation, and the overall recall and precision can reach 85.77% and 87.01%, respectively. Compared with the shot similarity graph method, two indicators have increased by 10% and 9% respectively. Compared with the DeepSSS method that also uses the deep learning network model, the comprehensive metric F-messure has increased by 8%.
- Research Article
11
- 10.1007/s11042-018-6959-4
- Dec 7, 2018
- Multimedia Tools and Applications
The available automatic temporal video scene segmentation methods still lack efficacy to be employed in most practical multimedia systems. The ones showing better results are multimodal and based on late fusion. On the other hand, early fusion has not been sufficiently investigated in this task because of the well known barriers of this approach: correlation identification, temporal synchronization and unique representation. This work presents a feature fusion method which deals with the mentioned difficulties and produces features which can enhance the efficacy of existing temporal video scene segmentation methods. This feature fusion process is performed on singlemodal Bag of Features feature vectors and is intended to enrich previously captured latent semantics by performing temporal clustering of features, providing an unified representation of multiple temporal related features. This feature fusion process have been coupled with two of-the-shelf scene segmentation algorithms, presenting competitive results when compared with two other state-of-the-art multimodal temporal scene segmentation methods. The results indicate that the proposed early fusion feature representation method is a promising alternative in helping to boost video retrieval related tasks.
- Research Article
- 10.4028/www.scientific.net/amm.513-517.514
- Feb 6, 2014
- Applied Mechanics and Materials
In order to quickly and effectively segment the video scene, a multi-modality video scene segmentation algorithm with shot force competition is proposed in this paper. This method is take full account of temporal associated co-occurrence of multimodal media data, to calculate the similarity between video shot by merging the video low-level features, then go to the video scene segmentation based on the judgment method of shot competition. The authors experiments show that the video scene can be efficiently separated by the method proposed in the paper.
- Conference Article
2
- 10.1145/2526188.2526202
- Nov 5, 2013
Recent advances in technology have increased the availability of video data, creating a strong requirement for efficient systems to manage those materials. To make efficient use of video information, first, the data has to be automatic segmented into smaller, manageable and understandable units, like scenes. This paper presents a new, multimodal video scene segmentation technique. The proposed approach is to combine Bag of Features based techniques (visual and aural) in order to explore the latent semantic obtained by them in complementary way, improving scene segmentation. The results achieved showed to be promising.
- Book Chapter
1
- 10.1201/9781003387374-35
- Nov 15, 2023
Fusion of multimodal data can effectively improve the perception ability of road infrastructure ontology. In this paper, a lightweight deep learning neural network is proposed to study the fusion segmentation effect of multimodal images under visible light, infrared light, and polarized light. The results showed that different modalities have different effects on the segmentation of different road materials. Especially for the recognition of road water, the segmentation effect was improved by 35.6% after fusing AoLP (angle of linear polarization) images. By using multimodal fusion segmentation, the mIoU (mean intersection over union) index was improved by 4.2% compared to ordinary RGB images.
- Research Article
- 10.1007/s00530-025-01941-z
- Aug 1, 2025
- Multimedia Systems
Multi-modal semi-supervised semantic segmentation for indoor scenes via adaptive CutMix and contrastive learning
- Research Article
162
- 10.1109/tcsvt.2011.2138830
- Aug 1, 2011
- IEEE Transactions on Circuits and Systems for Video Technology
In this paper, a novel approach to video temporal decomposition into semantic units, termed scenes, is presented. In contrast to previous temporal segmentation approaches that employ mostly low-level visual or audiovisual features, we introduce a technique that jointly exploits low-level and high-level features automatically extracted from the visual and the auditory channel. This technique is built upon the well-known method of the scene transition graph (STG), first by introducing a new STG approximation that features reduced computational cost, and then by extending the unimodal STG-based temporal segmentation technique to a method for multimodal scene segmentation. The latter exploits, among others, the results of a large number of TRECVID-type trained visual concept detectors and audio event detectors, and is based on a probabilistic merging process that combines multiple individual STGs while at the same time diminishing the need for selecting and fine-tuning several STG construction parameters. The proposed approach is evaluated on three test datasets, comprising TRECVID documentary films, movies, and news-related videos, respectively. The experimental results demonstrate the improved performance of the proposed approach in comparison to other unimodal and multimodal techniques of the relevant literature and highlight the contribution of high-level audiovisual features toward improved video segmentation to scenes.
- Conference Article
3
- 10.1117/12.2587991
- Apr 12, 2021
Our goal is to develop stable, accurate, and robust semantic scene understanding methods for wide-area scene perception and understanding, especially in challenging outdoor environments. To achieve this, we are exploring and evaluating a range of related technology and solutions, including AI-driven multimodal scene perception, fusion, processing, and understanding. This work reports our efforts on the evaluation of a state-of-the-art approach for semantic segmentation with multiple RGB and depth sensing data. We employ four large datasets composed of diverse urban and terrain scenes and design various experimental methods and metrics. In addition, we also develop new strategies of multi-datasets learning to improve the detection and recognition of unseen objects. Extensive experiments, implementations, and results are reported in the paper.
- Research Article
- 10.5302/j.icros.2023.22.0234
- Mar 31, 2023
- Journal of Institute of Control, Robotics and Systems
This paper proposes a novel model for 3D semantic scene segmentation in indoor environments. Existing models for 3D semantic scene segmentation use either only 3D geometric features of the scene point cloud or only 2D visual features of RGB color images. We overcome the limitations of existing models and improve the performance of 3D semantic scene segmentation by proposing a multimodal 3D semantic scene segmentation model to use both 3D geometric features of the scene point cloud and rich 2D visual features of multi-view color images. The proposed model overcomes the point sparsity problem by using the dense point cloud obtained from multi-view depth images and uses an adaptive point feature extractor to extract 3D geometric features representing the local structural characteristics of points. Moreover, the model adopts a unique early fusion strategy to fuse the 2D-3D features. Based on experiments conducted using the ScanNet benchmark dataset, we demonstrate the effectiveness and superiority of the proposed model.
- Conference Article
8
- 10.1109/itaic.2019.8785474
- May 1, 2019
Video scene segmentation has become one of the research hotspots in the video field because of its important role in improving retrieval accuracy, and plays a very important role in the construction of virtual scenes. In order to realize fast and accurate video scene segmentation, this paper proposes a multi-modal video scene segmentation algorithm based on ant colony algorithm. The algorithm extracts the physical features of different modes in key frames based on the idea of multi-modal feature fusion. The similarity between the same modal data and the correlation of different modal data are combined, and the similarity between different lenses is calculated. The lens similarity matrix is constructed and the ant colony algorithm is used to segment the video scene. The experimental data proves that the algorithm has a good segmentation effect on the video scene.
- Research Article
3
- 10.1155/2022/1264847
- Sep 8, 2022
- Computational Intelligence and Neuroscience
Aiming at the problem that online video learning resources of business English are scattered and the learners are inefficient in acquiring learning resources, this paper designed a business English learning system based on the EDIPT model. In addition, aiming at the problem of multifeature fusion between low-level features and high-level semantic features in video scenes, this paper proposes a multi-modal video scene segmentation algorithm based on a deep network. By minimizing the square sum of distances in the time period, the shots are clustered, and finally, the semantic scene is obtained. The experimental results show that the algorithm has good performance in classification accuracy and can effectively segment video scenes, which is helpful for users to improve their comprehensive business English skills.
- Dissertation
- 10.11606/t.55.2019.tde-28082019-110926
- Jan 1, 2019
Automatic video scene segmentation is a current and relevant problem given its application in various services related to multimedia. Among the different techniques reported in the literature, the multimodal ones are considered more promising, given the ability to extract information from different media in a potentially complementary way, allowing for more significant segmentations. By processing information of different natures, such techniques faces difficulties on modeling and obtaining a combined representation of information and cost problems when processing each source of information individually. Finding a suitable combination of information that increases the effectiveness of segmentation at a relatively low computational cost becomes a challenge. At the same time, approaches based on Deep Learning have proven effective on a wide range of tasks, including classification of images and video. Techniques based on Deep Learning, such as Convolutional Neural Networks (CNNs), have achieved impressive results in related tasks by being able to extract significant patterns from data, including multimodal data. However, CNNs can not properly learn the relationships between data temporarily distributed among the shots of the same scene. This can lead the network to become unable to properly segment scenes whose characteristics change among shots. On the other hand, Recurrent Neural Networks (RNNs) have been successfully employed in textual processing since they are designed to analyze variable-length data sequences and can be developed to better explore the temporal relationships between low-level characteristics of related shots, potentially increasing the effectiveness of scene segmentation. There is a lack of multimodal segmentation methods exploring Deep Learning. Thus, this thesis proposes an automatic method for video scene segmentation that models the problem of segmentation as a classification problem. The method relies on a model developed to combine the potential for extracting patterns from CNNs with the potential for sequence processing of the RNNs. The proposed model, different from related works, eliminates the difficulty of modeling multimodal representations of the different input information, besides allowing to instantiate different approaches for multimodal (early or late) fusion. This method was evaluated in the scene segmentation task using a public video database, comparing the results obtained with the results of state-of-the-art techniques using different approaches. The results show a significant advance in the efficiency obtained.
- Research Article
12
- 10.1109/tits.2024.3454597
- Dec 1, 2024
- IEEE Transactions on Intelligent Transportation Systems
The visible-light camera, which is capable of environment perception and navigation assistance, has emerged as an essential imaging sensor for marine surface vessels in intelligent waterborne transportation systems (IWTS). However, the visual imaging quality inevitably suffers from several kinds of degradations (e.g., limited visibility, low contrast, color distortion, etc.) under complex weather conditions (e.g., haze, rain, and low-lightness). The degraded visual information will accordingly result in inaccurate environment perception and delayed operations for navigational risk. To promote the navigational safety of vessels, many computational methods have been presented to perform visual quality enhancement under poor weather conditions. However, most of these methods are essentially specific-purpose implementation strategies, only available for one specific weather type. To overcome this limitation, we propose to develop a general-purpose multi-scene visibility enhancement method, i.e., edge reparameterization- and attention-guided neural network (ERANet), to adaptively restore the degraded images captured under different weather conditions. In particular, our ERANet simultaneously exploits the channel attention, spatial attention, and reparameterization technology to enhance the visual quality while maintaining low computational cost. Extensive experiments conducted on standard and IWTS-related datasets have demonstrated that our ERANet could outperform several representative visibility enhancement methods in terms of both imaging quality and computational efficiency. The superior performance of IWTS-related object detection and scene segmentation could also be steadily obtained after ERANet-based visibility enhancement under complex weather conditions.
- Conference Article
2
- 10.1109/icme.2012.167
- Jul 1, 2012
This paper proposes a new approach to classify obstacles using a time-of-flight camera, for applications in assistive navigation of the visually impaired. Combining range and intensity images enables fast and accurate object segmentation, and provides useful navigation cues such as distances to the nearby obstacles and obstacle types. In the proposed approach, a 3-D range image is first segmented using histogram thresholding and mean-shift grouping. Then Fourier and GIST descriptors are applied on each segmented object to extract shape and texture features. Finally, support vector machines are used to recognize the obstacles. This paper focuses on classifying pedestrian and non-pedestrian obstacles. Evaluated on an image data set acquired using a time-of-flight camera, the proposed approach achieves a classification rate of 99.5%.