RTS-ViT: Real-Time Share Vision Transformer for Image Classification.
Vision transformers have achieved remarkable success in image classification. The dual-branch vision transformer generates more features by taking advantage of feature fusion. Inspired by this, a dual-branch vision transformer with Real-Time Share feature was proposed during the encoding process for retinal image classification tasks. The approach processes image patches of varying sizes (base and large) through two independent branches and implements multi-stage Real-Time feature fusion via the Real-Time Share feature encoder. This encoder enables the branches to complement each other's features at each encoding stage, facilitating finer feature learning and enhancing the self-attention information passed to subsequent stages. It significantly boosts feature representation and classification performance. Additionally, a straightforward and effective feature fusion method, L-Times Attention Fusion, was proposed: vector concatenation for Real-Time Share feature in the earlier (L-1) encoding stages and element-wise addition for overall feature fusion at the L-th stage, achieving more efficient feature integration. The method was validated on a retinal image dataset. Results show that the approach outperforms the recent Cross-ViT average TOP-1 Acc by 5.61% with lower FLOPs and model parameters, without relying on pre-trained weights, highlighting stronger self-learning feature capabilities and reduced reliance on extensive pre-training data.
- Research Article
24
- 10.1167/tvst.13.2.16
- Feb 21, 2024
- Translational Vision Science & Technology
Retinal images contain rich biomarker information for neurodegenerative disease. Recently, deep learning models have been used for automated neurodegenerative disease diagnosis and risk prediction using retinal images with good results. In this review, we systematically report studies with datasets of retinal images from patients with neurodegenerative diseases, including Alzheimer's disease, Huntington's disease, Parkinson's disease, amyotrophic lateral sclerosis, and others. We also review and characterize the models in the current literature which have been used for classification, regression, or segmentation problems using retinal images in patients with neurodegenerative diseases. Our review found several existing datasets and models with various imaging modalities primarily in patients with Alzheimer's disease, with most datasets on the order of tens to a few hundred images. We found limited data available for the other neurodegenerative diseases. Although cross-sectional imaging data for Alzheimer's disease is becoming more abundant, datasets with longitudinal imaging of any disease are lacking. The use of bilateral and multimodal imaging together with metadata seems to improve model performance, thus multimodal bilateral image datasets with patient metadata are needed. We identified several deep learning tools that have been useful in this context including feature extraction algorithms specifically for retinal images, retinal image preprocessing techniques, transfer learning, feature fusion, and attention mapping. Importantly, we also consider the limitations common to these models in real-world clinical applications. This systematic review evaluates the deep learning models and retinal features relevant in the evaluation of retinal images of patients with neurodegenerative disease.
- Conference Article
3
- 10.1109/igarss47720.2021.9554465
- Jul 11, 2021
In remote sensing, hyperspectral (HS) and multispectral (MS) image fusion have emerged as a synthesis tool to improve the data set resolution. However, conventional image fusion methods typically degrade the performance of the land cover classification. In this paper, a feature fusion method from HS and MS images for pixel-based classification is proposed. More precisely, the proposed method first extracts spatial features from the MS image using morphological profiles. Then, the feature fusion model assumes that both the extracted morphological profiles and the HS image can be described as a feature matrix lying in different subspaces. An algorithm based on combining alternating optimization (AO) and the alternating direction method of multipliers (ADMM) is developed to solve efficiently the feature fusion problem. Finally, extensive simulations were run to evaluate the performance of the proposed feature fusion approach for two data sets. In general, the proposed approach exhibits a competitive performance compared to other feature extraction methods.
- Research Article
5
- 10.3390/rs13234823
- Nov 27, 2021
- Remote Sensing
Multi-sensor image can provide supplementary information, usually leading to better performance in classification tasks. However, the general deep neural network-based multi-sensor classification method learns each sensor image separately, followed by a stacked concentrate for feature fusion. This way requires a large time cost for network training, and insufficient feature fusion may cause. Considering efficient multi-sensor feature extraction and fusion with a lightweight network, this paper proposes an attention-guided classification method (AGCNet), especially for multispectral (MS) and panchromatic (PAN) image classification. In the proposed method, a share-split network (SSNet) including a shared branch and multiple split branches performs feature extraction for each sensor image, where the shared branch learns basis features of MS and PAN images with fewer learn-able parameters, and the split branch extracts the privileged features of each sensor image via multiple task-specific attention units. Furthermore, a selective classification network (SCNet) with a selective kernel unit is used for adaptive feature fusion. The proposed AGCNet can be trained by an end-to-end fashion without manual intervention. The experimental results are reported on four MS and PAN datasets, and compared with state-of-the-art methods. The classification maps and accuracies show the superiority of the proposed AGCNet model.
- Research Article
37
- 10.1109/tgrs.2022.3179288
- Jan 1, 2022
- IEEE Transactions on Geoscience and Remote Sensing
With more detailed spatial information being represented in very-high-resolution (VHR) remote sensing images, stringent requirements are imposed on accurate image classification. Due to the diverse land-objects with intraclass variation and interclass similarity, efficient and fine classification of VHR images especially in complex scenes is challenging. Even for some popular deep learning (DL) frameworks, geometric details of land-object may be lost in deep feature levels, so it is difficult to maintain the highly-detailed spatial information (e.g., edges, small objects) only relying on the last high-level layer. Moreover, many of the newly developed DL methods require massive well-labeled samples, which inevitably deteriorates the model generalization ability under the few-shot learning. Therefore, in this paper, a lightweight shallow-to-deep feature fusion network (SDF2N) is proposed for VHR image classification, where the traditional machine learning (ML) and DL schemes are integrated to learn rich and representative information to improve the classification accuracy. In particular, the shallow spectral-spatial features are first extracted, and then a novel triple-stage fusion (TSF) module is designed to learn the saliency and discriminative information at different levels for classification. The TSF module includes three feature fusion stages, i.e., low-level spectral-spatial feature fusion, middle-level multi-scale feature fusion, and high-level multi-layer feature fusion. The proposed SDF2N takes advantages of the shallow-to-deep features, which can extract representative and complementary information of crossing layers. It is important to note that even with limited training samples, the SDF2N still can achieve satisfying classification performance. Experimental results obtained on three real VHR remote sensing data sets including two multispectral and one airborne hyperspectral images covering complex urban scenarios confirm the effectiveness of the proposed approach compared with the state-of-the-art methods.
- Research Article
3
- 10.1038/s41598-024-67121-7
- Jul 10, 2024
- Scientific Reports
Cytomegalovirus retinitis (CMVR) is a significant cause of vision loss. Regular screening is crucial but challenging in resource-limited settings. A convolutional neural network is a state-of-the-art deep learning technique to generate automatic diagnoses from retinal images. However, there are limited numbers of CMVR images to train the model properly. Transfer learning (TL) is a strategy to train a model with a scarce dataset. This study explores the efficacy of TL with different pre-trained weights for automated CMVR classification using retinal images. We utilised a dataset of 955 retinal images (524 CMVR and 431 normal) from Siriraj Hospital, Mahidol University, collected between 2005 and 2015. Images were processed using Kowa VX-10i or VX-20 fundus cameras and augmented for training. We employed DenseNet121 as a backbone model, comparing the performance of TL with weights pre-trained on ImageNet, APTOS2019, and CheXNet datasets. The models were evaluated based on accuracy, loss, and other performance metrics, with the depth of fine-tuning varied across different pre-trained weights. The study found that TL significantly enhances model performance in CMVR classification. The best results were achieved with weights sequentially transferred from ImageNet to APTOS2019 dataset before application to our CMVR dataset. This approach yielded the highest mean accuracy (0.99) and lowest mean loss (0.04), outperforming other methods. The class activation heatmaps provided insights into the model's decision-making process. The model with APTOS2019 pre-trained weights offered the best explanation and highlighted the pathologic lesions resembling human interpretation. Our findings demonstrate the potential of sequential TL in improving the accuracy and efficiency of CMVR diagnosis, particularly in settings with limited data availability. They highlight the importance of domain-specific pre-training in medical image classification. This approach streamlines the diagnostic process and paves the way for broader applications in automated medical image analysis, offering a scalable solution for early disease detection.
- Research Article
32
- 10.3390/rs13224621
- Nov 17, 2021
- Remote Sensing
Multifarious hyperspectral image (HSI) classification methods based on convolutional neural networks (CNN) have been gradually proposed and achieve a promising classification performance. However, hyperspectral image classification still suffers from various challenges, including abundant redundant information, insufficient spectral-spatial representation, irregular class distribution, and so forth. To address these issues, we propose a novel 2D-3D CNN with spectral-spatial multi-scale feature fusion for hyperspectral image classification, which consists of two feature extraction streams, a feature fusion module as well as a classification scheme. First, we employ two diverse backbone modules for feature representation, that is, the spectral feature and the spatial feature extraction streams. The former utilizes a hierarchical feature extraction module to capture multi-scale spectral features, while the latter extracts multi-stage spatial features by introducing a multi-level fusion structure. With these network units, the category attribute information of HSI can be fully excavated. Then, to output more complete and robust information for classification, a multi-scale spectral-spatial-semantic feature fusion module is presented based on a Decomposition-Reconstruction structure. Last of all, we innovate a classification scheme to lift the classification accuracy. Experimental results on three public datasets demonstrate that the proposed method outperforms the state-of-the-art methods.
- Research Article
9
- 10.1016/j.neucom.2016.09.129
- Mar 8, 2017
- Neurocomputing
Exploiting score distribution for heterogenous feature fusion in image classification
- Research Article
1
- 10.1038/s41598-025-18329-8
- Sep 29, 2025
- Scientific reports
Few-shot classification is a very challenging task of computer vision. Recently, different from meta-learning, transfer-learning foregoing the episodic training strategy has gradually become popular in this community. Under this pipeline, how to learn a high-quality feature representation is vital for winning good performance. However, current works mainly build the classification model upon convolutional neural networks, which cannot extract discriminative features. To address the above problem, we propose exploring the non-local networks to construct classification model, which is trained by the joint learning of supervised and self-supervised tasks to obtain global invariant features. Further, we propose a few-shot classification algorithm using multi-stage fusion of local and global features, in which the fusion of features happens simultaneously during two stages of transfer-learning. The stage of pre-training implements parallel mechanism, in which the local feature network and global feature network mutually learn from each other, while the stage of few-shot testing implements serial mechanism through feature concatenation. We conducted extensive evaluations on multiple benchmark datasets to demonstrate the effectiveness of our method. Ablation studies have shown the effectiveness of the multi-stage feature fusion, and the comparison results have shown that our method can achieve better performance compared with other state-of-the-art methods.
- Conference Article
3
- 10.1117/12.2509312
- Jul 30, 2019
In order to make further and more accurate automatic analysis and processing of optical coherence tomography (OCT) images, such as layer segmentation, disease region segmentation, registration, etc, it is necessary to screen OCT images first. In this paper, we propose an efficient multi-class 3D retinal OCT image classification network named as VinceptionC3D. VinceptionC3D is a 3D convolutional neural network which is improved from basic C3D by adding improved 3D inception modules. Our main contributions are: (1) Demonstrate that a fine-tuned C3D which is pretrained on nature action video datasets can be applied for the classification of 3D retinal OCT images; (2) Improve the network by employing 3D inception module which can capture multi-scale features. The proposed method is trained and tested on 873 3D OCT images with 6 classes. The average accuracy of the C3D with random initialization weights, the C3D with pre-trained weights, and the proposed VinceptionC3D with pre-trained weights are 89.35%, 92.09% and 94.04%, respectively. The result shows that the proposed VinceptionC3D is effective for the 6-class 3D retinal OCT image classification.
- Research Article
9
- 10.2174/1573405620666230328092218
- Jul 11, 2023
- Current Medical Imaging Reviews
Brain tumour detection and classification require trained radiologists for efficient diagnosis. The proposed work aims to build a Computer Aided Diagnosis (CAD) tool to automate brain tumour detection using Machine Learning (ML) and Deep Learning (DL) techniques. Magnetic Resonance Image (MRI) collected from the publicly available Kaggle dataset is used for brain tumour detection and classification. Deep features extracted from the global pooling layer of Pretrained Resnet18 network are classified using 3 different ML Classifiers, such as Support vector Machine (SVM), K-Nearest Neighbour (KNN), and Decision Tree (DT). The above classifiers are further hyperparameter optimised using Bayesian Algorithm (BA) to enhance the performance. Fusion of features extracted from shallow and deep layers of the pretrained Resnet18 network followed by BA-optimised ML classifiers is further used to enhance the detection and classification performance. The confusion matrix derived from the classifier model is used to evaluate the system's performance. Evaluation metrics, such as accuracy, sensitivity, specificity, precision, F1 score, Balance Classification Rate (BCR), Mathews Correlation Coefficient (MCC) and Kappa Coefficient (Kp), are calculated. Maximum accuracy, sensitivity, specificity, precision, F1 score, BCR, MCC, and Kp of 99.11 %, 98.99 %, 99.22 %, 99.09 %, 99.09 %, 99.10 %, 98.21 %, 98.21 %, respectively, were obtained for detection using fusion of shallow and deep features of Resnet18 pretrained network classified by BA optimized SVM classifier. Feature fusion performs better for classification task with accuracy, sensitivity, specificity, precision, F1 score, BCR, MCC and Kp of 97.31 %, 97.30 %, 98.65 %, 97.37 %, 97.34 %, 97.97%, 95.99 %, 93.95 %, respectively. The proposed brain tumour detection and classification framework using deep feature extraction from Resnet 18 pretrained network in conjunction with feature fusion and optimised ML classifiers can improve the system performance. Henceforth, the proposed work can be used as an assistive tool to aid the radiologist in automated brain tumour analysis and treatment.
- Research Article
11
- 10.3390/curroncol30010042
- Dec 30, 2022
- Current Oncology
Objective: Precise classification of mass-forming intrahepatic cholangiocarcinoma (MF-ICC) and hepatocellular carcinoma (HCC) based on magnetic resonance imaging (MRI) is crucial for personalized treatment strategy. The purpose of the present study was to differentiate MF-ICC from HCC applying a novel deep-learning-based workflow with stronger feature extraction ability and fusion capability to improve the classification performance of deep learning on small datasets. Methods: To retain more effective lesion features, we propose a preprocessing method called semi-segmented preprocessing (Semi-SP) to select the region of interest (ROI). Then, the ROIs were sent to the strided feature fusion residual network (SFFNet) for training and classification. The SFFNet model is composed of three parts: the multilayer feature fusion module (MFF) was proposed to extract discriminative features of MF-ICC/HCC and integrate features of different levels; a new stationary residual block (SRB) was proposed to solve the problem of information loss and network instability during training; the attention mechanism convolutional block attention module (CBAM) was adopted in the middle layer of the network to extract the correlation of multi-spatial feature information, so as to filter the irrelevant feature information in pixels. Results: The SFFNet model achieved an overall accuracy of 92.26% and an AUC of 0.9680, with high sensitivity (86.21%) and specificity (94.70%) for MF-ICC. Conclusion: In this paper, we proposed a specifically designed Semi-SP method and SFFNet model to differentiate MF-ICC from HCC. This workflow achieves good MF-ICC/HCC classification performance due to stronger feature extraction and fusion capabilities, which provide complementary information for personalized treatment strategy.
- Research Article
47
- 10.1080/09540091.2021.1875987
- Jan 22, 2021
- Connection Science
The combination of features from the convolutional layer and the fully connected layer of a convolutional neural network (CNN) provides an effective way to improve the performance of crime scene investigation (CSI) image classification. However, in existing work, as the weights in feature fusion do not change after the training phase, it may produce inaccurate image features which affect classification results. To solve this problem, this paper proposes an adaptive feature fusion method based on an auto-encoder to improve classification accuracy. The method includes the following steps: Firstly, the CNN model is trained by transfer learning. Next, the features of the convolution layer and the fully connected layer are extracted respectively. These extracted features are then passed into the auto-encoder for further learning with Softmax normalisation to obtain the adaptive weights for performing final classification. Experiments demonstrated that the proposed method achieves higher CSI image classification performance compared with fix weights feature fusion.
- Research Article
4
- 10.3390/jimaging11040123
- Apr 21, 2025
- Journal of imaging
Early detection of diabetic retinopathy is critical for preserving vision in diabetic patients. The classification of lesions in Retinal fundus images, particularly macular edema, is an essential diagnostic tool, yet it presents a significant learning curve for both novice and experienced ophthalmologists. To address this challenge, a novel Convolutional Deep Belief Network (CDBN) is proposed to classify image patches into three distinct categories: two types of macular edema-microhemorrhages and hard exudates-and a healthy category. The method leverages high-level feature extraction to mitigate issues arising from the high similarity of low-level features in noisy images. Additionally, a Real-Coded Genetic Algorithm optimizes the parameters of Gabor filters and the network, ensuring optimal feature extraction and classification performance. Experimental results demonstrate that the proposed CDBN outperforms comparative models, achieving an F1 score of 0.9258. These results indicate that the architecture effectively overcomes the challenges of lesion classification in retinal images, offering a robust tool for clinical application and paving the way for advanced clinical decision support systems in diabetic retinopathy management.
- Book Chapter
9
- 10.1007/978-3-030-40605-9_23
- Jan 1, 2020
Retinal images have been increasingly important in clinical diagnostics of several eye and systemic diseases. To help the medical doctors in this work, automatic and semi-automatic diagnosis methods can be used to increase the efficiency of diagnostic and follow-up processes, as well as enable wider disease screening programs. However, the training of advanced machine learning methods for improved retinal image analysis typically requires large and representative retinal image data sets. Even when large data sets of retinal images are available, the occurrence of different medical conditions is unbalanced in them. Hence, there is a need to enrich the existing data sets by data augmentation and introducing noise that is essential to build robust and reliable machine learning models. One way to overcome these shortcomings relies on generative models for synthesizing images. To study the limits of retinal image synthesis, this paper focuses on the deep generative models including a generative adversarial network and a variational autoencoder to synthesize images from noise without conditioning on any information regarding to the retina. The models are trained with the Kaggle EyePACS retinal image set, and for quantifying the image quality in a no-reference manner, the generated images are compared with the retinal images of the DiaRetDB1 database using common similarity metrics.
- Conference Article
4
- 10.1109/ncetstea48365.2020.9119931
- Feb 1, 2020
The main objective of medical image processing field is to design computational tools which will assist quantification and visualization of remarkable pathology and anatomical structure. Diabetic retinopathy is a medical disorder where the retina is damaged due to fluids leak from the blood vessels into the retina of human eye. The identification of optic disk in retinal fundus images and quantitative study of the evolution of its shape and size plays an important role in diagnosing different pathologies, and the abnormalities related to the retina of human eye. Most of the abnormalities which are related to optic disc may leads to a structural changes in the inner and the outer area of the optic disc. Optic disc identification and segmentation on the level of the whole retinal image reduces the detection sensitivity for those parts. In this research, an advanced classification based on hierarchical process for the detection and segmentation of optic disc has been proposed. The exact boundary of optic disc is obtained by calculating the region of interest and applying an innovative morphological transformation based adaptive thresholding. The presented technique helps to reduce the process area needed for segmentation techniques leading to a distinguished performance enhancement and reducing the amount of the needed computational cost for each retinal fundus image. The proposed technique has been evaluated on publicly available data sets of retinal images which are DIARETDB1, DRIVE, HRF, DRIONS-DB, IDRiD and STARE, and a remarkable improvement has been found over the existing techniques in terms of accuracy and processing time.