Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Motion-aware multi-level feature fusion for few-shot action recognition

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Motion-aware multi-level feature fusion for few-shot action recognition

Similar Papers
  • Conference Article
  • Cite Count Icon 3
  • 10.1109/iccc54389.2021.9674659
Remote Sensing Image Scene Classification Based on Multi-level Feature Fusion
  • Dec 10, 2021
  • Ya Chen + 4 more

Remote sensing image scene classification is an important task in the fields of remote sensing image analysis and interpretation. Convolutional neural networks (CNNs) has been a representative image classification network owing to its capacity of feature learning and representation. Although the semantic information of feature map was enhanced along with the convolution layer increases, the spatial geometric detail information may be lost, which will greatly affect the accuracy of classification. To solve these problem, we proposed a multi-level convolutional feature fusion network (MLCF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> N) for remote sensing image scene classification. Specifically, the original image is fed to the multi-level feature extraction network to extract features of different levels by increasing the depth of the convolution layer. And then, in order to highlight the key interest areas of all levels, we design a key information weight network to obtain feature map of the discriminative area in each level feature. Finally, we fuse the low-level features, middle- level features and high- level features to get more discriminative features for scene classification. Experiments are conducted on several public remote sensing image scene classification datasets: AID, NWPU-RESISC45 and OPTIMAL 31 to evaluate the performance of the method in this paper. The experimental results show that multi- level feature fusion classification is better than single- level feature classification and some of the state of art methods.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/indin51773.2022.9976099
Multi-level Feature Reweighting and Fusion for Instance Segmentation
  • Jul 25, 2022
  • Xuan-Thuy Vo + 3 more

Accurate instance segmentation requires high-resolution features for performing a dense pixel-wise prediction task. However, using high-resolution feature maps results in highly expensive model complexity and ineffective receptive fields. To overcome the problems of high-resolution features, conventional methods explore multi-level feature fusion that exchanges the information between low-level features at earlier layers and high-level features at top layers. Both low and high information is extracted by the hierarchical backbone network where high-level features contain more semantic cues and low-level features encompass more specific patterns. Thus, adopting these features to the training segmentation model is necessary, and designing a more efficient multi-level feature fusion is crucial. Existing methods balance such information by using top-down and bottom-up pathway connections with more inefficient convolution layers to produce richer multi-scale features. In this work, we contribute two folds: (1) a simple but effective multilevel feature reweighting layer is proposed to strengthen deep high-level features based on channel reweighting generated from multiple features of the backbone, and (2) an efficient fusion block is proposed to process low-resolution features in a depth-to-spatial manner and combine enhanced multi-level features together. These designs enable the segmentation models to predict instance kernels for mask generation on high-level feature maps. To verify the effectiveness of the proposed method, we conduct experiments on the challenging benchmark dataset MS-COCO. Surprisingly, our simple network outperforms the baseline in both accuracy and inference speed. More specifically, we achieve 35.4% AP <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">mask</sup> at 19.5 FPS on a GPU device, becoming a state-of-the-art instance segmentation method.

  • Research Article
  • 10.32604/csse.2023.040132
MSF-Net: A Multilevel Spatiotemporal Feature Fusion Network Combines Attention for Action Recognition
  • Jan 1, 2023
  • Computer Systems Science and Engineering
  • Mengmeng Yan + 5 more

An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction, information redundancy, and insufficient extraction of frequency domain information in channels in 3D convolutional neural networks. Firstly, based on 3D CNN, this paper designs a new multilevel spatiotemporal feature fusion (MSF) structure, which is embedded in the network model, mainly through multilevel spatiotemporal feature separation, splicing and fusion, to achieve the fusion of spatial perceptual fields and short-medium-long time series information at different scales with reduced network parameters; In the second step, a multi-frequency channel and spatiotemporal attention module (FSAM) is introduced to assign different frequency features and spatiotemporal features in the channels are assigned corresponding weights to reduce the information redundancy of the feature maps. Finally, we embed the proposed method into the R3D model, which replaced the 2D convolutional filters in the 2D Resnet with 3D convolutional filters and conduct extensive experimental validation on the small and medium-sized dataset UCF101 and the large-sized dataset Kinetics-400. The findings revealed that our model increased the recognition accuracy on both datasets. Results on the UCF101 dataset, in particular, demonstrate that our model outperforms R3D in terms of a maximum recognition accuracy improvement of 7.2% while using 34.2% fewer parameters. The MSF and FSAM are migrated to another traditional 3D action recognition model named C3D for application testing. The test results based on UCF101 show that the recognition accuracy is improved by 8.9%, proving the strong generalization ability and universality of the method in this paper.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.saa.2023.123147
For cervical cancer diagnosis: Tissue Raman spectroscopy and multi-level feature fusion with SENet attention mechanism
  • Jul 13, 2023
  • Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy
  • Yang Liu + 4 more

For cervical cancer diagnosis: Tissue Raman spectroscopy and multi-level feature fusion with SENet attention mechanism

  • Research Article
  • 10.1088/2057-1976/ae2623
MLGF-GAN: a multi-level local-global feature fusion GAN for OCT image super-resolution
  • Dec 10, 2025
  • Biomedical Physics & Engineering Express
  • Tingting Han + 9 more

Optical coherence tomography (OCT), a non-invasive imaging modality, holds significant clinical value in cardiology and ophthalmology. However, its imaging quality is often constrained by inherently limited resolution, thereby affecting diagnostic utility. For OCT-based diagnosis, enhancing perceptual quality that emphasizes human visual recognition ability and diagnostic effectiveness is crucial. Existing super-resolution methods prioritize reconstruction accuracy (e.g., PSNR optimization) but neglect perceptual quality. To address this, we propose a Multi-level Local-Global feature Fusion Generative Adversarial Network (MLGF-GAN) that systematically integrates local details, global contextual information, and multilevel features to fully exploit the recoverable information in the image. The Local Feature Extractor (LFE) employs Coordinate Attention-enhanced convolutional neural network (CNN) for lesion-focused local feature refinement, and the Global Feature Extractor (GFE) employs shifted-window Transformers to model long-range dependencies. The Multi-level Feature Fusion Structure (MFFS) hierarchically aggregates image features and adaptively processes information at different scales. The multi-scale (×2, ×4, ×8) evaluations conducted on coronary and retinal OCT datasets demonstrate that the proposed model achieves highly competitive perceptual quality across all scales while maintaining reconstruction accuracy. The generated OCT super-resolution images exhibit superior texture detail restoration and spectral consistency, contributing to improved accuracy and reliability in clinical assessment. Furthermore, cross-pathology experiments further demonstrate that the proposed model possesses excellent generalization capability.

  • Research Article
  • 10.1177/10775463251399722
Fault diagnosis method for train bogie axlebox based on deep learning feature fusion of CWT-MVGG
  • Nov 18, 2025
  • Journal of Vibration and Control
  • Huikang Lv + 5 more

This paper addresses critical challenges in rail transit bogie fault diagnosis: strong noise interference, inadequate fault feature representation, and limited generalization in small-sample conditions. A deep learning-based fault diagnosis method is proposed, integrating variational modal decomposition (VMD), continuous wavelet transform (CWT), and a multi-level multi-scale feature fusion visual geometry group network (MVGG). The method utilizes a three-phase synergistic optimization framework. First, an adaptive VMD algorithm decomposes the original vibration signal into intrinsic mode functions. Second, CWT converts the one-dimensional time-series signal into a two-dimensional time-frequency image. Finally, the MVGG network is designed to incorporate multi-scale feature extraction, hybrid attention mechanisms, and multi-level feature fusion. Experimental results show that the proposed method achieves 98.98% fault diagnosis accuracy with 70% training samples. Compared with representative methods, it significantly enhances intra-class compactness and inter-class separability in the feature space. Notably, under 20% small-sample conditions, the method maintains 95.96% accuracy, demonstrating its parameter efficiency and strong generalization capability. Comprehensive analyses show that the proposed method exhibits superior noise robustness, enhanced feature discriminability, and strong small-sample adaptability in bogie axlebox bearing fault diagnosis. These findings offer a reliable solution for intelligent operation and maintenance of rail transit systems in complex operational environments.

  • Conference Article
  • Cite Count Icon 3
  • 10.1117/12.2640791
Image sentiment analysis method based on multi-level feature fusion
  • Oct 13, 2022
  • Tao S Sun + 2 more

With the development of social platforms, people are more likely to use pictures to share their sentiments and opinions. With the development of deep learning, the current mainstream image sentiment analysis research mainly focuses on the use of high-level semantic sentiment features of images, and less on the use of low-level sentiment features such as colors, lines and so on. Aiming at the above problems, this paper proposes an image sentiment analysis model based on multi-level feature fusion (MLFF). Firstly, the convolution neural network is used to learn different levels of image features; Then the multi-level features are input into the bidirectional sequential neural network, and attention is introduced to fuse to generate multi-level fusion features; Finally, the multi-level fusion features are input to the full connection layer for classification. The experimental results show that the fusion model proposed in this paper can make effective use of multi-level image features and effectively improve the performance of image sentiment analysis.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 29
  • 10.1109/access.2019.2940051
Deep Learning Based Weighted Feature Fusion Approach for Sentiment Analysis
  • Jan 1, 2019
  • IEEE Access
  • Mohd Usama + 5 more

Deep learning algorithms have achieved remarkable results in the natural language processing(NLP) and computer vision. Hence, a trend still going on to use these algorithms, such as convolution and recurrent neural networks, for text analytic task to extract useful information. Features extraction is one of the important reasons behind the success of these networks. Moreover passing features from one layer to another layer within the network and one network to another network have done. However multilevel and multitype features fusion remains unexplored in sentiment analysis. So, in this paper, we use three datasets to display the advantages of extracting and fusing multilevel as well as multitype features from different neural networks. Multilevel features are from different layers of the same network, and multitype features are from different network architectures. Experiment results demonstrate that the proposed model based on multilevel and multitype weighted features fusion outperforms than many exiting works with an accuracy of 80.2%, 48.2%, and 87.0% on MR, SST1, and SST2 datasets respectively.

  • Research Article
  • 10.1145/3767331
MAFF-Net: A Multi-level Acoustic Feature Fusion Network For Synthetic Audio Detection
  • Sep 30, 2025
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Dong Chen + 5 more

Voice spoofing attacks have become a significant challenge in today’s security domain. Although progress has been made in synthetic speech detection technology, existing detection methods still struggle to effectively identify unknown attack strategies. To address these challenges, we propose a novel multi-level acoustic feature fusion framework, MAFF-Net, which comprises three main components: multi-level acoustic feature extraction, cross-attention feature fusion and graph-aggregated detection module. The multi-level acoustic feature extraction module involves two complementary processes: multi-spectrogram feature extraction, which captures low-level physical characteristics of the audio signal, and Wav2vec2 feature extraction, which focuses on high-level speech representations. These multi-level features are subsequently integrated through cross-attention, enhancing the discriminative power of the model. To better evaluate the generalization capability of the proposed model, we introduce Chinese Advanced Synthetic Speech Dataset (CASSD), a new dataset that incorporates speech generated using 11 state-of-the-art synthesis techniques. Extensive experiments conducted across four different datasets demonstrate that our approach consistently outperforms existing single-model methods, highlighting the superior performance of MAFF-Net in synthetic speech detection.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.3389/fphy.2023.1136021
Multilevel feature cooperative alignment and fusion for unsupervised domain adaptation smoke detection
  • Feb 3, 2023
  • Frontiers in Physics
  • Fangrong Zhou + 7 more

Early smoke detection using Digital Image Processing technology is an important research field, which has great applications in reducing fire hazards and protecting the ecological environment. Due to the complex changes of color, shape and size of smoke with time, it is challenging to accurately recognize smoke from a given image. In addition, limited by domain shift, the trained detector is difficult to adapt to the smoke in real scenes, resulting in a sharp drop in detection performance. In order to solve this problem, an unsupervised domain adaptive smoke detection algorithm rely on Multilevel feature Cooperative Alignment and Fusion (MCAF) was proposed in this paper. Firstly, the cooperative domain alignment is performed on the features of different scales obtained by the feature extraction network to reduce the domain difference and enhance the generalization ability of the model. Secondly, multilevel feature fusion modules were embedded at different depths of the network to enhance the representation ability of small targets. The proposed method is evaluated on multiple datasets, and the results show the effectiveness of the method.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/vcip47243.2019.8965776
Saliency-based Deep Multi-level Semantic Feature Fusion for Person Re-identification
  • Dec 1, 2019
  • Yinhao Wang + 4 more

Person re-identification (Re-ID) is a challenging problem due to external environmental disturbances and significant intra-class appearance variations. Discriminative person features should cover global and partial representations, and accordingly can describe high-level and middle-level semantic information of persons. To achieve this goal, a saliency-based deep multi-level semantic feature representation and fusion algorithm is proposed. Firstly, a feature fusion scheme at middle layer of our deep network is presented to effectively fuse global CNN features. Secondly, a part-based feature extraction method is designed to extract high-level semantic features. In addition, a parameter-free multi-scale saliency-based enhancement algorithm is proposed to compute patch-level saliency scores to enhance the distinctiveness of part-based features. Experimental results on three public datasets, namely, Market-1501, DukeMTMC-reID, and CHUK03, demonstrate the effectiveness of the proposed method compared with state-of-the-art Re-ID approaches.

  • Research Article
  • Cite Count Icon 45
  • 10.1016/j.jag.2022.102676
MapsNet: Multi-level feature constraint and fusion network for change detection
  • Mar 4, 2022
  • International Journal of Applied Earth Observation and Geoinformation
  • Jianping Pan + 9 more

MapsNet: Multi-level feature constraint and fusion network for change detection

  • Book Chapter
  • Cite Count Icon 13
  • 10.1007/978-3-030-59719-1_45
Multi-phase and Multi-level Selective Feature Fusion for Automated Pancreas Segmentation from CT Images
  • Jan 1, 2020
  • Xixi Jiang + 7 more

CT images scanned in arterial and venous phases have been demonstrated to provide complementary information for accurate pancreas segmentation. In this paper, we propose a novel multi-phase and multi-level selective feature fusion network (MMNet) with a core component named adaptive cross refinement (ACR) module. Specifically, MMNet adopts two parallel encoders to extract features of the two phases respectively, which are then fused by ACR to excel each complementarity advantage. Unlike most existing fusion methods which only exchange and combine features of a single level with the same resolution between two phases/modalities, ACR module intelligently aggregates features of all levels in one phase as a multi-level prior, and then adaptively selects the most effective information from the multi-level prior to refine features at each level of the other phase. Such multi-phase, multi-level selective feature exchange and fusion strategy is bi-directional to mutually benefit segmentation of both phases. Experimental results on 141 cases of our private dataset demonstrate the effectiveness of our ACR module and superior performance to the state-of-the-art fusion methods.

  • Conference Article
  • Cite Count Icon 335
  • 10.1109/iccv.2017.533
RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation
  • Oct 1, 2017
  • Seungyong Lee + 2 more

In multi-class indoor semantic segmentation using RGB-D data, it has been shown that incorporating depth feature into RGB feature is helpful to improve segmentation accuracy. However, previous studies have not fully exploited the potentials of multi-modal feature fusion, e.g., simply concatenating RGB and depth features or averaging RGB and depth score maps. To learn the optimal fusion of multimodal features, this paper presents a novel network that extends the core idea of residual learning to RGB-D semantic segmentation. Our network effectively captures multilevel RGB-D CNN features by including multi-modal feature fusion blocks and multi-level feature refinement blocks. Feature fusion blocks learn residual RGB and depth features and their combinations to fully exploit the complementary characteristics of RGB and depth data. Feature refinement blocks learn the combination of fused features from multiple levels to enable high-resolution prediction. Our network can efficiently train discriminative multi-level features from each modality end-to-end by taking full advantage of skip-connections. Our comprehensive experiments demonstrate that the proposed architecture achieves the state-of-the-art accuracy on two challenging RGB-D indoor datasets, NYUDv2 and SUN RGB-D.

  • Research Article
  • Cite Count Icon 16
  • 10.1186/s12859-023-05212-4
Multi-view feature representation and fusion for drug-drug interactions prediction
  • Mar 14, 2023
  • BMC bioinformatics
  • Jing Wang + 5 more

BackgroundDrug-drug interactions (DDIs) prediction is vital for pharmacology and clinical application to avoid adverse drug reactions on patients. It is challenging because DDIs are related to multiple factors, such as genes, drug molecular structure, diseases, biological processes, side effects, etc. It is a crucial technology for Knowledge graph to present multi-relation among entities. Recently some existing graph-based computation models have been proposed for DDIs prediction and get good performance. However, there are still some challenges in the knowledge graph representation, which can extract rich latent features from drug knowledge graph (KG).ResultsIn this work, we propose a novel multi-view feature representation and fusion (MuFRF) architecture to realize DDIs prediction. It consists of two views of feature representation and a multi-level latent feature fusion. For the feature representation from the graph view and KG view, we use graph isomorphism network to map drug molecular structures and use RotatE to implement the vector representation on bio-medical knowledge graph, respectively. We design concatenate-level and scalar-level strategies in the multi-level latent feature fusion to capture latent features from drug molecular structure information and semantic features from bio-medical KG. And the multi-head attention mechanism achieves the optimization of features on binary and multi-class classification tasks. We evaluate our proposed method based on two open datasets in the experiments. Experiments indicate that MuFRF outperforms the classic and state-of-the-art models.ConclusionsOur proposed model can fully exploit and integrate the latent feature from the drug molecular structure graph (graph view) and rich bio-medical knowledge graph (KG view). We find that a multi-view feature representation and fusion model can accurately predict DDIs. It may contribute to providing with some guidance for research and validation for discovering novel DDIs.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant