A Decoding Scheme With Successive Aggregation of Multi-Level Features For Light-Weight Semantic Segmentation
Multi-scale architecture, including hierarchical vision transformer, has been commonly applied to high-resolution semantic segmentation to deal with computational complexity with minimum performance loss. In this paper, we propose a novel decoding scheme for semantic segmentation in this regard, which takes multi-level features from the encoder with multiscale architecture. The decoding scheme based on a multi-level vision transformer aims to achieve not only reduced computational expense but also higher segmentation accuracy, by introducing successive cross-attention in aggregation of the multi-level features. Furthermore, a way to enhance the multilevel features by the aggregated semantics is proposed. The effort is focused on maintaining the contextual consistency from the perspective of attention allocation and brings improved performance with significantly lower computational cost. Set of experiments on popular datasets demonstrates superiority of the proposed scheme to the state-of-the-art semantic segmentation models in terms of computational cost without loss of accuracy, and extensive ablation studies prove the effectiveness of ideas proposed.
- Conference Article
7
- 10.1117/12.2550857
- Mar 16, 2020
- Medical Imaging 2020: Computer-Aided Diagnosis
Weakly supervised disease classification of CT imaging suffers from poor\nlocalization owing to case-level annotations, where even a positive scan can\nhold hundreds to thousands of negative slices along multiple planes.\nFurthermore, although deep learning segmentation and classification models\nextract distinctly unique combinations of anatomical features from the same\ntarget class(es), they are typically seen as two independent processes in a\ncomputer-aided diagnosis (CAD) pipeline, with little to no feature reuse. In\nthis research, we propose a medical classifier that leverages the semantic\nstructural concepts learned via multi-resolution segmentation feature maps, to\nguide weakly supervised 3D classification of chest CT volumes. Additionally, a\ncomparative analysis is drawn across two different types of feature aggregation\nto explore the vast possibilities surrounding feature fusion. Using a dataset\nof 1593 scans labeled on a case-level basis via rule-based model, we train a\ndual-stage convolutional neural network (CNN) to perform organ segmentation and\nbinary classification of four representative diseases (emphysema,\npneumonia/atelectasis, mass and nodules) in lungs. The baseline model, with\nseparate stages for segmentation and classification, results in AUC of 0.791.\nUsing identical hyperparameters, the connected architecture using static and\ndynamic feature aggregation improves performance to AUC of 0.832 and 0.851,\nrespectively. This study advances the field in two key ways. First, case-level\nreport data is used to weakly supervise a 3D CT classifier of multiple,\nsimultaneous diseases for an organ. Second, segmentation and classification\nmodels are connected with two different feature aggregation strategies to\nenhance the classification performance.\n
- Research Article
- 10.3390/e27080862
- Aug 14, 2025
- Entropy
Retinal vessel segmentation plays a crucial role in diagnosing various retinal and cardiovascular diseases and serves as a foundation for computer-aided diagnostic systems. Blood vessels in color retinal fundus images, captured using fundus cameras, are often affected by illumination variations and noise, making it difficult to preserve vascular integrity and posing a significant challenge for vessel segmentation. In this paper, we propose HM-Mamba, a novel hierarchical multi-scale Mamba-based architecture that incorporates tubular structure-aware convolution to extract both local and global vascular features for retinal vessel segmentation. First, we introduce a tubular structure-aware convolution to reinforce vessel continuity and integrity. Building on this, we design a multi-scale fusion module that aggregates features across varying receptive fields, enhancing the model’s robustness in representing both primary trunks and fine branches. Second, we integrate multi-branch Fourier transform with the dynamic state modeling capability of Mamba to capture both long-range dependencies and multi-frequency information. This design enables robust feature representation and adaptive fusion, thereby enhancing the network’s ability to model complex spatial patterns. Furthermore, we propose a hierarchical multi-scale interactive Mamba block that integrates multi-level encoder features through gated Mamba-based global context modeling and residual connections, enabling effective multi-scale semantic fusion and reducing detail loss during downsampling. Extensive evaluations on five widely used benchmark datasets—DRIVE, CHASE_DB1, STARE, IOSTAR, and LES-AV—demonstrate the superior performance of HM-Mamba, yielding Dice coefficients of 0.8327, 0.8197, 0.8239, 0.8307, and 0.8426, respectively.
- Research Article
19
- 10.1088/1361-6501/abfbfd
- May 26, 2021
- Measurement Science and Technology
Semantic segmentation of high-resolution remote sensing images has a wide range of applications, such as territorial planning, geographic monitoring and smart cities. The proper operation of semantic segmentation for remote sensing images remains challenging due to the complex and diverse transitions between different ground areas. Although several convolution neural networks (CNNs) have been developed for remote sensing semantic segmentation, the performance of CNNs is far from the expected target. This study presents a deep feature aggregation network (DFANet) for remote sensing image semantic segmentation. It is composed of a basic feature representation layer, an intermediate feature aggregation layer, a deep feature aggregation layer and a feature aggregation module (FAM). Specially, the basic feature representation layer is used to obtain feature maps with different resolutions: the intermediate feature aggregation layer and deep feature aggregation layer can fuse various resolution features and multi-scale features; the FAM is used to splice the features and form more abundant spatial feature maps; and the conditional random field module is used to optimize semantic segmentation results. We have performed extensive experiments on the ISPRS two-dimensional Vaihingen and Potsdam remote sensing image datasets and compared the proposed method with several variations of semantic segmentation networks. The experimental results show that DFANet outperforms the other state-of-the-art approaches.
- Conference Article
150
- 10.1109/iccv.2019.00433
- Oct 1, 2019
Aggregating multi-level features is essential for capturing multi-scale context information for precise scene semantic segmentation. However, the improvement by directly fusing shallow features and deep features becomes limited as the semantic gap between them increases. To solve this problem, we explore two strategies for robust feature fusion. One is enhancing shallow features using a semantic enhancement module (SeEM) to alleviate the semantic gap between shallow features and deep features. The other strategy is feature attention, which involves discovering complementary information (i.e., boundary information) from low-level features to enhance high-level features for precise segmentation. By embedding these two strategies, we construct a parallel feature pyramid towards improving multi-level feature fusion. A Semantic Enhanced Network called SeENet is constructed with the parallel pyramid to implement precise segmentation. Experiments on three benchmark datasets demonstrate the effectiveness of our method for robust multi-level feature aggregation. As a result, our SeENet has achieved better performance than other state-of-the-art methods for semantic segmentation.
- Research Article
4
- 10.1371/journal.pone.0297331
- Mar 11, 2024
- PLOS ONE
KRAS is a pathogenic gene frequently implicated in non-small cell lung cancer (NSCLC). However, biopsy as a diagnostic method has practical limitations. Therefore, it is important to accurately determine the mutation status of the KRAS gene non-invasively by combining NSCLC CT images and genetic data for early diagnosis and subsequent targeted therapy of patients. This paper proposes a Semi-supervised Multimodal Multiscale Attention Model (S2MMAM). S2MMAM comprises a Supervised Multilevel Fusion Segmentation Network (SMF-SN) and a Semi-supervised Multimodal Fusion Classification Network (S2MF-CN). S2MMAM facilitates the execution of the classification task by transferring the useful information captured in SMF-SN to the S2MF-CN to improve the model prediction accuracy. In SMF-SN, we propose a Triple Attention-guided Feature Aggregation module for obtaining segmentation features that incorporate high-level semantic abstract features and low-level semantic detail features. Segmentation features provide pre-guidance and key information expansion for S2MF-CN. S2MF-CN shares the encoder and decoder parameters of SMF-SN, which enables S2MF-CN to obtain rich classification features. S2MF-CN uses the proposed Intra and Inter Mutual Guidance Attention Fusion (I2MGAF) module to first guide segmentation and classification feature fusion to extract hidden multi-scale contextual information. I2MGAF then guides the multidimensional fusion of genetic data and CT image data to compensate for the lack of information in single modality data. S2MMAM achieved 83.27% AUC and 81.67% accuracy in predicting KRAS gene mutation status in NSCLC. This method uses medical image CT and genetic data to effectively improve the accuracy of predicting KRAS gene mutation status in NSCLC.
- Conference Article
- 10.1109/robio.2016.7866432
- Dec 1, 2016
In this paper, a novel semantic segmentation model based on aggregated features and contextual information is proposed. Given an RGB-D image, we train a support vector machine (SVM) to predict initial labels using aggregated features, and then optimize the predicted results using contextual information. For aggregated features, the local features on regions are extracted to capture visual appearance of object, and the global features are exploited to represent scene information such that the proposed model can utilize more discriminative features. For contextual information, a novel multi-label conditional random field (CRF) model is constructed to jointly optimize the initial semantic and attribute predicted results. The experimental results on the public NYU v2 dataset demonstrate the proposed model outperforms the existing state-of-the-art methods on a challenging 40 classes task, yielding a higher mean IU accuracy of 33.7% and pixel average accuracy of 64.1%. Especially, the prediction accuracy of “small” classes has been improved significantly.
- Research Article
3
- 10.1109/access.2022.3190966
- Jan 1, 2022
- IEEE Access
Most graph convolutional neural networks process point clouds by constructing local graphs, then increasing the number of channels by 1x1 convolution, and using max pooling aggregation features. However, there is no direct semantic information interaction between channels after aggregating features. Moreover, only max pooling is used after graph construction, which loses the most features of points. Therefore, we propose a new method to enhance the local graph semantic feature (EDGS) of point cloud. This method consists of semantic feature interaction branch, graph attention branch, and feature aggregation. We use k-nearest neighbor algorithm to construct local graphs. After building the local graph, we use two branches to extract local features. In the first branch, max pooling is used to aggregate local graphs semantic features. Then, the concept of grouping is adopted to guide the semantics of individual channels by using the semantics of group channels to strengthen the feature interaction between channels. In the second branch, in order to preserve the features of points, we use graph attention to assign different weights on the local graph, and sum to aggregate the features between points. Finally, two learnable parameters are used to adaptively aggregate the local features of the two branches. Experimental results show that this method improves the performance on ModelNet40 and ShapeNetPart datasets. The results are 93.5% and 85.6% respectively.
- Research Article
22
- 10.1109/access.2022.3163535
- Jan 1, 2022
- IEEE Access
For semantic segmentation of remote sensing images (RSI), trade-off between representation power and location accuracy is quite important. How to get the trade-off effectively is an open question, where current approaches of utilizing very deep models result in complex models with large memory consumption. In contrast to previous work that utilizes dilated convolutions or deep models, we propose a novel two-stream deep neural network for semantic segmentation of RSI (RSI-Net) to obtain improved performance through modeling and propagating spatial contextual structure effectively and a decoding scheme with image-level and graph-level combination. The first component explicitly models correlations between adjacent land covers and conduct flexible convolution on arbitrarily irregular image regions by using graph convolutional network, while densely connected atrous convolution network (DenseAtrousCNet) with multi-scale atrous convolution can expand the receptive fields and obtain image global information. Extensive experiments are implemented on the Vaihingen, Potsdam and Gaofen RSI datasets, where the comparison results demonstrate the superior performance of RSI-Net in terms of overall accuracy (91.83%, 93.31% and 93.67% on three datasets, respectively), F1 score (90.3%, 91.49% and 89.35% on three datasets, respectively) and kappa coefficient (89.46%, 90.46% and 90.37% on three datasets, respectively) when compared with six state-of-the-art RSI semantic segmentation methods.
- Research Article
56
- 10.1109/tmm.2020.2971175
- Dec 1, 2020
- IEEE Transactions on Multimedia
Pixel-level segmentation has been widely used to improve object detection. Most of the existing methods refine detection features by adding the constraint of the segmentation branch or by simply embedding high-level segmentation features into detection features within the local receptive field. However, noisy segmentation features are unavoidable in real-word applications and can easily cause false positives. To address this problem, we propose a novel hierarchical context embedding module to effectively embed segmentation features into detection features. The idea of this module is to capture hierarchical context information that includes local objects or parts and nonlocal context features by learning multiple attention maps, and subsequently utilize interdependencies between features to recalibrate noisy segmentation features. Furthermore, we use this module in the proposed gated encoder-decoder network that adaptively aggregates feature maps of different resolutions based on the gate mechanism so that we can embed multiscale segmentation feature maps into detection features for more accurate detection of objects of all sizes. Experimental results demonstrate the effectiveness of the proposed method on the Pascal VOC 2012Seg dataset, the Pascal VOC dataset and the MS COCO dataset.
- Research Article
2
- 10.17562/pb-47-5
- Jun 30, 2013
- Polibits
In this paper we use the statistics provided by a field experiment to explore the utility of supplying machine translation suggestions in a computer-assisted translation (CAT) environment.Regression models are trained for each user in order to estimate the time to edit (TTE) for the current translation segment.We use a combination of features from the current segment and aggregated features from formerly translated segments selected with content-based filtering approaches commonly used in recommendation systems.We present and evaluate decision function heuristics to determine if machine translation output will be useful for the translator in the given segment.We find that our regression models do a reasonable job for some users in predicting TTE given only a small number of training examples; although noise in the actual TTE for seemingly similar segments yields large error margins.We propose to include the estimation of TTE in CAT recommendation systems as a well-correlated metric for translation quality.
- Research Article
13
- 10.1109/lgrs.2021.3058427
- Feb 22, 2021
- IEEE Geoscience and Remote Sensing Letters
Remarkable improvements have been seen in the semantic segmentation of remote-sensing images. As an effective structure to aggregate shallow information and deep information, encoder–decoder structure has been widely used in many state-of-the-art models, but it possesses two drawbacks that have not been fully addressed. On the one hand, encoder–decoder structure fuses the features obtained from shallow and deep layers directly; despite harvesting some detailed information, it also brings in noisy features owing to the poor discriminant ability of the shallow layers. On the other hand, existing encoder–decoder structure merely fuses the high-level information generated by the last layer of encoder once, which neglects its guidance ability to the feature aggregation process in the decoder. In this letter, we first propose an edge perception module (EPM) to eliminate the noisy features in the shallow information, as well as enhance features’ structural information. And then, we generate the most suitable guidance information adaptively for different stages in the decoder through high-level information module (HIM). Finally, we apply the guidance information to achieve feature aggregation in the feature aggregation module (FAM). Combined with EPM, HIM, and FAM, our proposed model achieves 89.5% overall accuracy (OA) on the challenging ISPRS Vaihingen test set, which is the new state-of-the-art in the semantic segmentation of remote-sensing images.
- Research Article
6
- 10.3390/electronics12030680
- Jan 29, 2023
- Electronics
Video salient object detection has attracted growing interest in recent years. However, some existing video saliency models often suffer from the inappropriate utilization of spatial and temporal cues and the insufficient aggregation of different level features, leading to remarkable performance degradation. Therefore, we propose a quality-driven dual-branch feature integration network majoring in the adaptive fusion of multi-modal cues and sufficient aggregation of multi-level spatiotemporal features. Firstly, we employ the quality-driven multi-modal feature fusion (QMFF) module to combine the spatial and temporal features. Particularly, the quality scores estimated from each level’s spatial and temporal cues are not only used to weigh the two modal features but also to adaptively integrate the coarse spatial and temporal saliency predictions into the guidance map, which further enhances the two modal features. Secondly, we deploy the dual-branch-based multi-level feature aggregation (DMFA) module to integrate multi-level spatiotemporal features, where the two branches including the progressive decoder branch and the direct concatenation branch sufficiently explore the cooperation of multi-level spatiotemporal features. In particular, in order to provide an adaptive fusion for the outputs of the two branches, we design the dual-branch fusion (DF) unit, where the channel weight of each output can be learned jointly from the two outputs. The experiments conducted on four video datasets clearly demonstrate the effectiveness and superiority of our model against the state-of-the-art video saliency models.
- Research Article
10
- 10.1016/j.eswa.2023.119647
- Feb 1, 2023
- Expert Systems With Applications
Rethinking text rectification for scene text recognition
- Book Chapter
17
- 10.1007/978-3-030-59710-8_34
- Jan 1, 2020
Aggregating multi-level feature representation plays a critical role in achieving robust volumetric medical image segmentation, which is important for the auxiliary diagnosis and treatment. Unlike the recent neural architecture search (NAS) methods that typically searched the optimal operators in each network layer, but missed a good strategy to search for feature aggregations, this paper proposes a novel NAS method for 3D medical image segmentation, named UXNet, which searches both the scale-wise feature aggregation strategies as well as the block-wise operators in the encoder-decoder network. UXNet has several appealing benefits. (1) It significantly improves flexibility of the classical UNet architecture, which only aggregates feature representations of encoder and decoder in equivalent resolution. (2) A continuous relaxation of UXNet is carefully designed, enabling its searching scheme performed in an efficient differentiable manner. (3) Extensive experiments demonstrate the effectiveness of UXNet compared with recent NAS methods for medical image segmentation. The architecture discovered by UXNet outperforms existing state-of-the-art models in terms of Dice on several public 3D medical image segmentation benchmarks, especially for the boundary locations and tiny tissues. The searching computational complexity of UXNet is cheap, enabling to search a network with best performance less than 1.5 days on two TitanXP GPUs.
- Conference Article
15
- 10.1109/ictai.2017.00172
- Nov 1, 2017
In the field of ADAS and self-driving car, lane and drivable road detection play an essential role in reliably accomplishing other tasks, such as objects detection. For monocular vision based semantic segmentation of lane and road, we propose a dilated feature pyramid network (FPN) with feature aggregation, called DFFA, where feature aggregation is employed to combine multi-level features enhanced with dilated convolution operations and FPN under the framework of ResNet. Experimental results validate effectiveness and efficiency of the proposed deep learning model for semantic segmentation of lane and drivable road. Our DFFA achieves the best performance both on Lane Estimation Evaluation and Behavior Evaluation tasks in KITTI-ROAD and take the second place on UU ROAD task.