RTS-ViT: Real-Time Share Vision Transformer for Image Classification.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Vision transformers have achieved remarkable success in image classification. The dual-branch vision transformer generates more features by taking advantage of feature fusion. Inspired by this, a dual-branch vision transformer with Real-Time Share feature was proposed during the encoding process for retinal image classification tasks. The approach processes image patches of varying sizes (base and large) through two independent branches and implements multi-stage Real-Time feature fusion via the Real-Time Share feature encoder. This encoder enables the branches to complement each other's features at each encoding stage, facilitating finer feature learning and enhancing the self-attention information passed to subsequent stages. It significantly boosts feature representation and classification performance. Additionally, a straightforward and effective feature fusion method, L-Times Attention Fusion, was proposed: vector concatenation for Real-Time Share feature in the earlier (L-1) encoding stages and element-wise addition for overall feature fusion at the L-th stage, achieving more efficient feature integration. The method was validated on a retinal image dataset. Results show that the approach outperforms the recent Cross-ViT average TOP-1 Acc by 5.61% with lower FLOPs and model parameters, without relying on pre-trained weights, highlighting stronger self-learning feature capabilities and reduced reliance on extensive pre-training data.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 24
  • 10.1167/tvst.13.2.16
Deep Learning and Machine Learning Algorithms for Retinal Image Analysis in Neurodegenerative Disease: Systematic Review of Datasets and Models.
  • Feb 21, 2024
  • Translational Vision Science & Technology
  • Tyler Bahr + 3 more

Retinal images contain rich biomarker information for neurodegenerative disease. Recently, deep learning models have been used for automated neurodegenerative disease diagnosis and risk prediction using retinal images with good results. In this review, we systematically report studies with datasets of retinal images from patients with neurodegenerative diseases, including Alzheimer's disease, Huntington's disease, Parkinson's disease, amyotrophic lateral sclerosis, and others. We also review and characterize the models in the current literature which have been used for classification, regression, or segmentation problems using retinal images in patients with neurodegenerative diseases. Our review found several existing datasets and models with various imaging modalities primarily in patients with Alzheimer's disease, with most datasets on the order of tens to a few hundred images. We found limited data available for the other neurodegenerative diseases. Although cross-sectional imaging data for Alzheimer's disease is becoming more abundant, datasets with longitudinal imaging of any disease are lacking. The use of bilateral and multimodal imaging together with metadata seems to improve model performance, thus multimodal bilateral image datasets with patient metadata are needed. We identified several deep learning tools that have been useful in this context including feature extraction algorithms specifically for retinal images, retinal image preprocessing techniques, transfer learning, feature fusion, and attention mapping. Importantly, we also consider the limitations common to these models in real-world clinical applications. This systematic review evaluates the deep learning models and retinal features relevant in the evaluation of retinal images of patients with neurodegenerative disease.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/igarss47720.2021.9554465
Subspace-Based Feature Fusion from Hyperspectral and Multispectral Images for Land Cover Classification
  • Jul 11, 2021
  • Juan Ramirez + 3 more

In remote sensing, hyperspectral (HS) and multispectral (MS) image fusion have emerged as a synthesis tool to improve the data set resolution. However, conventional image fusion methods typically degrade the performance of the land cover classification. In this paper, a feature fusion method from HS and MS images for pixel-based classification is proposed. More precisely, the proposed method first extracts spatial features from the MS image using morphological profiles. Then, the feature fusion model assumes that both the extracted morphological profiles and the HS image can be described as a feature matrix lying in different subspaces. An algorithm based on combining alternating optimization (AO) and the alternating direction method of multipliers (ADMM) is developed to solve efficiently the feature fusion problem. Finally, extensive simulations were run to evaluate the performance of the proposed feature fusion approach for two data sets. In general, the proposed approach exhibits a competitive performance compared to other feature extraction methods.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/rs13234823
Attention-Guided Multispectral and Panchromatic Image Classification
  • Nov 27, 2021
  • Remote Sensing
  • Cheng Shi + 4 more

Multi-sensor image can provide supplementary information, usually leading to better performance in classification tasks. However, the general deep neural network-based multi-sensor classification method learns each sensor image separately, followed by a stacked concentrate for feature fusion. This way requires a large time cost for network training, and insufficient feature fusion may cause. Considering efficient multi-sensor feature extraction and fusion with a lightweight network, this paper proposes an attention-guided classification method (AGCNet), especially for multispectral (MS) and panchromatic (PAN) image classification. In the proposed method, a share-split network (SSNet) including a shared branch and multiple split branches performs feature extraction for each sensor image, where the shared branch learns basis features of MS and PAN images with fewer learn-able parameters, and the split branch extracts the privileged features of each sensor image via multiple task-specific attention units. Furthermore, a selective classification network (SCNet) with a selective kernel unit is used for adaptive feature fusion. The proposed AGCNet can be trained by an end-to-end fashion without manual intervention. The experimental results are reported on four MS and PAN datasets, and compared with state-of-the-art methods. The classification maps and accuracies show the superiority of the proposed AGCNet model.

  • Research Article
  • Cite Count Icon 37
  • 10.1109/tgrs.2022.3179288
A Shallow-to-Deep Feature Fusion Network for VHR Remote Sensing Image Classification
  • Jan 1, 2022
  • IEEE Transactions on Geoscience and Remote Sensing
  • Sicong Liu + 7 more

With more detailed spatial information being represented in very-high-resolution (VHR) remote sensing images, stringent requirements are imposed on accurate image classification. Due to the diverse land-objects with intraclass variation and interclass similarity, efficient and fine classification of VHR images especially in complex scenes is challenging. Even for some popular deep learning (DL) frameworks, geometric details of land-object may be lost in deep feature levels, so it is difficult to maintain the highly-detailed spatial information (e.g., edges, small objects) only relying on the last high-level layer. Moreover, many of the newly developed DL methods require massive well-labeled samples, which inevitably deteriorates the model generalization ability under the few-shot learning. Therefore, in this paper, a lightweight shallow-to-deep feature fusion network (SDF2N) is proposed for VHR image classification, where the traditional machine learning (ML) and DL schemes are integrated to learn rich and representative information to improve the classification accuracy. In particular, the shallow spectral-spatial features are first extracted, and then a novel triple-stage fusion (TSF) module is designed to learn the saliency and discriminative information at different levels for classification. The TSF module includes three feature fusion stages, i.e., low-level spectral-spatial feature fusion, middle-level multi-scale feature fusion, and high-level multi-layer feature fusion. The proposed SDF2N takes advantages of the shallow-to-deep features, which can extract representative and complementary information of crossing layers. It is important to note that even with limited training samples, the SDF2N still can achieve satisfying classification performance. Experimental results obtained on three real VHR remote sensing data sets including two multispectral and one airborne hyperspectral images covering complex urban scenarios confirm the effectiveness of the proposed approach compared with the state-of-the-art methods.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.1038/s41598-024-67121-7
Selection of pre-trained weights for transfer learning in automated cytomegalovirus retinitis classification
  • Jul 10, 2024
  • Scientific Reports
  • Pitipol Choopong + 1 more

Cytomegalovirus retinitis (CMVR) is a significant cause of vision loss. Regular screening is crucial but challenging in resource-limited settings. A convolutional neural network is a state-of-the-art deep learning technique to generate automatic diagnoses from retinal images. However, there are limited numbers of CMVR images to train the model properly. Transfer learning (TL) is a strategy to train a model with a scarce dataset. This study explores the efficacy of TL with different pre-trained weights for automated CMVR classification using retinal images. We utilised a dataset of 955 retinal images (524 CMVR and 431 normal) from Siriraj Hospital, Mahidol University, collected between 2005 and 2015. Images were processed using Kowa VX-10i or VX-20 fundus cameras and augmented for training. We employed DenseNet121 as a backbone model, comparing the performance of TL with weights pre-trained on ImageNet, APTOS2019, and CheXNet datasets. The models were evaluated based on accuracy, loss, and other performance metrics, with the depth of fine-tuning varied across different pre-trained weights. The study found that TL significantly enhances model performance in CMVR classification. The best results were achieved with weights sequentially transferred from ImageNet to APTOS2019 dataset before application to our CMVR dataset. This approach yielded the highest mean accuracy (0.99) and lowest mean loss (0.04), outperforming other methods. The class activation heatmaps provided insights into the model's decision-making process. The model with APTOS2019 pre-trained weights offered the best explanation and highlighted the pathologic lesions resembling human interpretation. Our findings demonstrate the potential of sequential TL in improving the accuracy and efficiency of CMVR diagnosis, particularly in settings with limited data availability. They highlight the importance of domain-specific pre-training in medical image classification. This approach streamlines the diagnostic process and paves the way for broader applications in automated medical image analysis, offering a scalable solution for early disease detection.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 32
  • 10.3390/rs13224621
A Novel 2D-3D CNN with Spectral-Spatial Multi-Scale Feature Fusion for Hyperspectral Image Classification
  • Nov 17, 2021
  • Remote Sensing
  • Dongxu Liu + 6 more

Multifarious hyperspectral image (HSI) classification methods based on convolutional neural networks (CNN) have been gradually proposed and achieve a promising classification performance. However, hyperspectral image classification still suffers from various challenges, including abundant redundant information, insufficient spectral-spatial representation, irregular class distribution, and so forth. To address these issues, we propose a novel 2D-3D CNN with spectral-spatial multi-scale feature fusion for hyperspectral image classification, which consists of two feature extraction streams, a feature fusion module as well as a classification scheme. First, we employ two diverse backbone modules for feature representation, that is, the spectral feature and the spatial feature extraction streams. The former utilizes a hierarchical feature extraction module to capture multi-scale spectral features, while the latter extracts multi-stage spatial features by introducing a multi-level fusion structure. With these network units, the category attribute information of HSI can be fully excavated. Then, to output more complete and robust information for classification, a multi-scale spectral-spatial-semantic feature fusion module is presented based on a Decomposition-Reconstruction structure. Last of all, we innovate a classification scheme to lift the classification accuracy. Experimental results on three public datasets demonstrate that the proposed method outperforms the state-of-the-art methods.

  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.neucom.2016.09.129
Exploiting score distribution for heterogenous feature fusion in image classification
  • Mar 8, 2017
  • Neurocomputing
  • Chengkun He + 4 more

Exploiting score distribution for heterogenous feature fusion in image classification

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-025-18329-8
Multi-stage fusion of local and global features for few-shot image classification.
  • Sep 29, 2025
  • Scientific reports
  • Yi Gu + 2 more

Few-shot classification is a very challenging task of computer vision. Recently, different from meta-learning, transfer-learning foregoing the episodic training strategy has gradually become popular in this community. Under this pipeline, how to learn a high-quality feature representation is vital for winning good performance. However, current works mainly build the classification model upon convolutional neural networks, which cannot extract discriminative features. To address the above problem, we propose exploring the non-local networks to construct classification model, which is trained by the joint learning of supervised and self-supervised tasks to obtain global invariant features. Further, we propose a few-shot classification algorithm using multi-stage fusion of local and global features, in which the fusion of features happens simultaneously during two stages of transfer-learning. The stage of pre-training implements parallel mechanism, in which the local feature network and global feature network mutually learn from each other, while the stage of few-shot testing implements serial mechanism through feature concatenation. We conducted extensive evaluations on multiple benchmark datasets to demonstrate the effectiveness of our method. Ablation studies have shown the effectiveness of the multi-stage feature fusion, and the comparison results have shown that our method can achieve better performance compared with other state-of-the-art methods.

  • Conference Article
  • Cite Count Icon 3
  • 10.1117/12.2509312
VinceptionC3D: a 3D convolutional neural network for retinal OCT image classification
  • Jul 30, 2019
  • Shuanglang Feng + 5 more

In order to make further and more accurate automatic analysis and processing of optical coherence tomography (OCT) images, such as layer segmentation, disease region segmentation, registration, etc, it is necessary to screen OCT images first. In this paper, we propose an efficient multi-class 3D retinal OCT image classification network named as VinceptionC3D. VinceptionC3D is a 3D convolutional neural network which is improved from basic C3D by adding improved 3D inception modules. Our main contributions are: (1) Demonstrate that a fine-tuned C3D which is pretrained on nature action video datasets can be applied for the classification of 3D retinal OCT images; (2) Improve the network by employing 3D inception module which can capture multi-scale features. The proposed method is trained and tested on 873 3D OCT images with 6 classes. The average accuracy of the C3D with random initialization weights, the C3D with pre-trained weights, and the proposed VinceptionC3D with pre-trained weights are 89.35%, 92.09% and 94.04%, respectively. The result shows that the proposed VinceptionC3D is effective for the 6-class 3D retinal OCT image classification.

  • Research Article
  • Cite Count Icon 9
  • 10.2174/1573405620666230328092218
Automated Brain Tumour Detection and Classification using Deep Features and Bayesian Optimised Classifiers.
  • Jul 11, 2023
  • Current Medical Imaging Reviews
  • S Arun Kumar + 1 more

Brain tumour detection and classification require trained radiologists for efficient diagnosis. The proposed work aims to build a Computer Aided Diagnosis (CAD) tool to automate brain tumour detection using Machine Learning (ML) and Deep Learning (DL) techniques. Magnetic Resonance Image (MRI) collected from the publicly available Kaggle dataset is used for brain tumour detection and classification. Deep features extracted from the global pooling layer of Pretrained Resnet18 network are classified using 3 different ML Classifiers, such as Support vector Machine (SVM), K-Nearest Neighbour (KNN), and Decision Tree (DT). The above classifiers are further hyperparameter optimised using Bayesian Algorithm (BA) to enhance the performance. Fusion of features extracted from shallow and deep layers of the pretrained Resnet18 network followed by BA-optimised ML classifiers is further used to enhance the detection and classification performance. The confusion matrix derived from the classifier model is used to evaluate the system's performance. Evaluation metrics, such as accuracy, sensitivity, specificity, precision, F1 score, Balance Classification Rate (BCR), Mathews Correlation Coefficient (MCC) and Kappa Coefficient (Kp), are calculated. Maximum accuracy, sensitivity, specificity, precision, F1 score, BCR, MCC, and Kp of 99.11 %, 98.99 %, 99.22 %, 99.09 %, 99.09 %, 99.10 %, 98.21 %, 98.21 %, respectively, were obtained for detection using fusion of shallow and deep features of Resnet18 pretrained network classified by BA optimized SVM classifier. Feature fusion performs better for classification task with accuracy, sensitivity, specificity, precision, F1 score, BCR, MCC and Kp of 97.31 %, 97.30 %, 98.65 %, 97.37 %, 97.34 %, 97.97%, 95.99 %, 93.95 %, respectively. The proposed brain tumour detection and classification framework using deep feature extraction from Resnet 18 pretrained network in conjunction with feature fusion and optimised ML classifiers can improve the system performance. Henceforth, the proposed work can be used as an assistive tool to aid the radiologist in automated brain tumour analysis and treatment.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 11
  • 10.3390/curroncol30010042
A Deep Learning Workflow for Mass-Forming Intrahepatic Cholangiocarcinoma and Hepatocellular Carcinoma Classification Based on MRI
  • Dec 30, 2022
  • Current Oncology
  • Yangling Liu + 5 more

Objective: Precise classification of mass-forming intrahepatic cholangiocarcinoma (MF-ICC) and hepatocellular carcinoma (HCC) based on magnetic resonance imaging (MRI) is crucial for personalized treatment strategy. The purpose of the present study was to differentiate MF-ICC from HCC applying a novel deep-learning-based workflow with stronger feature extraction ability and fusion capability to improve the classification performance of deep learning on small datasets. Methods: To retain more effective lesion features, we propose a preprocessing method called semi-segmented preprocessing (Semi-SP) to select the region of interest (ROI). Then, the ROIs were sent to the strided feature fusion residual network (SFFNet) for training and classification. The SFFNet model is composed of three parts: the multilayer feature fusion module (MFF) was proposed to extract discriminative features of MF-ICC/HCC and integrate features of different levels; a new stationary residual block (SRB) was proposed to solve the problem of information loss and network instability during training; the attention mechanism convolutional block attention module (CBAM) was adopted in the middle layer of the network to extract the correlation of multi-spatial feature information, so as to filter the irrelevant feature information in pixels. Results: The SFFNet model achieved an overall accuracy of 92.26% and an AUC of 0.9680, with high sensitivity (86.21%) and specificity (94.70%) for MF-ICC. Conclusion: In this paper, we proposed a specifically designed Semi-SP method and SFFNet model to differentiate MF-ICC from HCC. This workflow achieves good MF-ICC/HCC classification performance due to stronger feature extraction and fusion capabilities, which provide complementary information for personalized treatment strategy.

  • Research Article
  • Cite Count Icon 47
  • 10.1080/09540091.2021.1875987
Adaptive weights learning in CNN feature fusion for crime scene investigation image classification
  • Jan 22, 2021
  • Connection Science
  • Liu Ying + 8 more

The combination of features from the convolutional layer and the fully connected layer of a convolutional neural network (CNN) provides an effective way to improve the performance of crime scene investigation (CSI) image classification. However, in existing work, as the weights in feature fusion do not change after the training phase, it may produce inaccurate image features which affect classification results. To solve this problem, this paper proposes an adaptive feature fusion method based on an auto-encoder to improve classification accuracy. The method includes the following steps: Firstly, the CNN model is trained by transfer learning. Next, the features of the convolution layer and the fully connected layer are extracted respectively. These extracted features are then passed into the auto-encoder for further learning with Softmax normalisation to obtain the adaptive weights for performing final classification. Experiments demonstrated that the proposed method achieves higher CSI image classification performance compared with fix weights feature fusion.

  • Research Article
  • Cite Count Icon 4
  • 10.3390/jimaging11040123
Evolutionary-Driven Convolutional Deep Belief Network for the Classification of Macular Edema in Retinal Fundus Images.
  • Apr 21, 2025
  • Journal of imaging
  • Rafael A García-Ramírez + 4 more

Early detection of diabetic retinopathy is critical for preserving vision in diabetic patients. The classification of lesions in Retinal fundus images, particularly macular edema, is an essential diagnostic tool, yet it presents a significant learning curve for both novice and experienced ophthalmologists. To address this challenge, a novel Convolutional Deep Belief Network (CDBN) is proposed to classify image patches into three distinct categories: two types of macular edema-microhemorrhages and hard exudates-and a healthy category. The method leverages high-level feature extraction to mitigate issues arising from the high similarity of low-level features in noisy images. Additionally, a Real-Coded Genetic Algorithm optimizes the parameters of Gabor filters and the network, ensuring optimal feature extraction and classification performance. Experimental results demonstrate that the proposed CDBN outperforms comparative models, achieving an F1 score of 0.9258. These results indicate that the architecture effectively overcomes the challenges of lesion classification in retinal images, offering a robust tool for clinical application and paving the way for advanced clinical decision support systems in diabetic retinopathy management.

  • Book Chapter
  • Cite Count Icon 9
  • 10.1007/978-3-030-40605-9_23
Evaluation of Unconditioned Deep Generative Synthesis of Retinal Images
  • Jan 1, 2020
  • Sinan Kaplan + 3 more

Retinal images have been increasingly important in clinical diagnostics of several eye and systemic diseases. To help the medical doctors in this work, automatic and semi-automatic diagnosis methods can be used to increase the efficiency of diagnostic and follow-up processes, as well as enable wider disease screening programs. However, the training of advanced machine learning methods for improved retinal image analysis typically requires large and representative retinal image data sets. Even when large data sets of retinal images are available, the occurrence of different medical conditions is unbalanced in them. Hence, there is a need to enrich the existing data sets by data augmentation and introducing noise that is essential to build robust and reliable machine learning models. One way to overcome these shortcomings relies on generative models for synthesizing images. To study the limits of retinal image synthesis, this paper focuses on the deep generative models including a generative adversarial network and a variational autoencoder to synthesize images from noise without conditioning on any information regarding to the retina. The models are trained with the Kaggle EyePACS retinal image set, and for quantifying the image quality in a no-reference manner, the generated images are compared with the retinal images of the DiaRetDB1 database using common similarity metrics.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/ncetstea48365.2020.9119931
Automatic Detection and Segmentation of Optic Disc (ADSO) of Retinal Fundus Images Based on Mathematical Morphology
  • Feb 1, 2020
  • Niladri Halder + 5 more

The main objective of medical image processing field is to design computational tools which will assist quantification and visualization of remarkable pathology and anatomical structure. Diabetic retinopathy is a medical disorder where the retina is damaged due to fluids leak from the blood vessels into the retina of human eye. The identification of optic disk in retinal fundus images and quantitative study of the evolution of its shape and size plays an important role in diagnosing different pathologies, and the abnormalities related to the retina of human eye. Most of the abnormalities which are related to optic disc may leads to a structural changes in the inner and the outer area of the optic disc. Optic disc identification and segmentation on the level of the whole retinal image reduces the detection sensitivity for those parts. In this research, an advanced classification based on hierarchical process for the detection and segmentation of optic disc has been proposed. The exact boundary of optic disc is obtained by calculating the region of interest and applying an innovative morphological transformation based adaptive thresholding. The presented technique helps to reduce the process area needed for segmentation techniques leading to a distinguished performance enhancement and reducing the amount of the needed computational cost for each retinal fundus image. The proposed technique has been evaluated on publicly available data sets of retinal images which are DIARETDB1, DRIVE, HRF, DRIONS-DB, IDRiD and STARE, and a remarkable improvement has been found over the existing techniques in terms of accuracy and processing time.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant