Enhancing digital art style recognition via a hybrid vision transformer and lightweight CNN with attention mechanisms
Enhancing digital art style recognition via a hybrid vision transformer and lightweight CNN with attention mechanisms
- Research Article
- 10.3390/buildings15020176
- Jan 9, 2025
- Buildings
The digital recognition and preservation of historical architectural heritage has become a critical challenge in cultural inheritance and sustainable urban development. While deep learning methods show promise in architectural classification, existing models often struggle to achieve ideal results due to the complexity and uniqueness of historical buildings, particularly the limited data availability in remote areas. Focusing on the study of Chinese historical architecture, this research proposes an innovative architectural recognition framework that integrates the Swin Transformer backbone with a custom-designed Global Channel and Spatial Attention (GCSA) mechanism, thereby substantially enhancing the model’s capability to extract architectural details and comprehend global contextual information. Through extensive experiments on a constructed historical building dataset, our model achieves an outstanding performance of over 97.8% in key metrics including accuracy, precision, recall, and F1 score (harmonic mean of the precision and recall), surpassing traditional CNN (convolutional neural network) architectures and contemporary deep learning models. To gain deeper insights into the model’s decision-making process, we employed comprehensive interpretability methods including t-SNE (t-distributed Stochastic Neighbor Embedding), Grad-CAM (gradient-weighted class activation mapping), and multi-layer feature map analysis, revealing the model’s systematic feature extraction process from structural elements to material textures. This study offers substantial technical support for the digital modeling and recognition of architectural heritage in historical buildings, establishing a foundation for heritage damage assessment. It contributes to the formulation of precise restoration strategies and provides a scientific basis for governments and cultural heritage institutions to develop region-specific policies for conservation efforts.
- Conference Article
- 10.1117/12.2640778
- Oct 3, 2022
Iris recognition is considered as one of the most promising biometrics due to its discriminative features and friendly acquisition methods. Herein, a deep learning-based method is proposed to achieve more accurate and efficient iris recognition. The proposed framework Iris Attention Network (IrisAttenNet) integrates the attention mechanism into a lightweight CNN to extract iris features more specifically. In the process of feature learning, the channel features with more information that contribute to the recognition result will attract more attention and be given higher weights, which is similar to the human visual perception mechanism. The performance of the proposed framework is evaluated by four publicly available datasets representing different intra-class variations: CASIA_Iris_V4 Interval, Lamp, Thousand and UBIRIS.v1. The experimental results have demonstrated that the approach based on the IrisAttenNet shows higher accuracy, stronger generalization and less computational cost. The intermediate outcomes heat maps have proved that the key contribution of the attention module through visualization of the feature areas of images.
- Research Article
21
- 10.1016/j.compbiomed.2023.106606
- Jan 23, 2023
- Computers in Biology and Medicine
One-stage and lightweight CNN detection approach with attention: Application to WBC detection of microscopic images
- Research Article
- 10.14569/ijacsa.2026.0170142
- Jan 1, 2026
- International Journal of Advanced Computer Science and Applications
Driver drowsiness is a major cause of traffic accidents, so Edge-IoT platforms with limited resources need to be able to accurately and quickly detect when drivers are drowsy. This study examines attention-guided lightweight CNN design predicated on MobileNetV2 for real-time driver drowsiness detection. The authors compare a SE-enhanced MobileNetV2 to the baseline model and a structurally optimized version that uses Depthwise Separable Convolution (DSC), Bottleneck blocks, and Expansion layers. Experiments on 500 images demonstrate that channel attention enhances feature discrimination, whereas structural optimization yields the most resilient trade-off between accuracy and latency. Statistical validation employing 95% confidence intervals and two-proportion Z-tests substantiates the significance of these enhancements. The proposed models support real-time inference despite their small size (about 2.6 million parameters and 315 million FLOPs). These findings suggest structural optimization is more important than attention mechanisms in designing lightweight CNNs for embedded driver monitoring.
- Preprint Article
- 10.21203/rs.3.rs-4536797/v1
- Jun 20, 2024
The surface characteristics of billets are crucial for subsequent traceability, yet the production process generates intricate digital features on their surfaces. This paper introduces BDR-Net, a novel billet surface digit recognition network. Drawing inspiration from Inception, the network adopts a ResNext-like architecture as its primary framework. It uniformly distributes output in dimension, extracts positional and scale features separately, and introduces a mixed dilated convolution block to reduce parameters while expanding the sensory field. To address the challenge of lost up-sampled features during fusion, an innovative stream alignment-based up-sampled feature fusion algorithm is proposed. Additionally, to enhance the network's focus on extracting salient spatial and channel features, a mixed-dimensional attention mechanism (scSE) is integrated into the alignment-based upsampling feature fusion module. Experimental results showcase BDR-Net's outstanding performance, achieving an impressive 95.6\% accuracy in digitally classifying billet surfaces, surpassing the ResNext50\_32x4d benchmark model by 4.3\% in recognition accuracy. Moreover, compared to current classification networks, this model exhibits significant accuracy improvements. Furthermore, the mAP@0.95 metric reaches 0.897, surpassing current classification networks. These findings underscore the remarkable performance of the model in billet surface digit recognition, offering an effective solution for digit recognition on billet surfaces in steel mills.
- Research Article
24
- 10.1109/lgrs.2020.3031593
- Nov 2, 2020
- IEEE Geoscience and Remote Sensing Letters
Synthetic aperture radar automatic target recognition (SAR ATR) is a key technique of remote-sensing image recognition, which has many potential applications in the fields of military surveillance, national defense, civil application, and so on. With the development of science and technology, deep convolutional neural network (DCNN) has been widely applied for SAR ATR. However, it is difficult to use deep learning to train models with limited ray SAR images. To resolve this problem, we proposed an effectively lightweight attention mechanism CNN (AM-CNN) model for SAR ATR. Extensive experimental results on the Moving and Stationary Target Acquisition and Recognition (MSTAR) data set illustrate that the AM-CNN model can achieve a superior recognition performance, and the average recognition accuracy can reach 99.35% on the classification of 10 class targets. Compared with the traditional CNN and the state-of-the-art method, our model is significantly superior to improve performance and efficiency.
- Conference Article
1
- 10.1109/ctisc54888.2022.9849794
- Apr 22, 2022
The ancient Chinese characters appear in various historical documents and poetry. People tend to use optical character recognition tools to understand these uncommon characters. The current Chinese text recognition interface is restricted to a limited character set, such as GB2312-80 and GB18010-2005 standard. However, the newest HanYu Dictionary contains over 55K characters, much more than the commonly-used character set. This work proposes a compact deep network (HYD-CNet) composed of depthwise separable convolutional blocks and co-ordinate attention mechanism to recognize the ancient Chinese characters. It can achieve efficient retrieval and low-storage need for large-scale character recognition on mobile devices. We build a Chinese character database (HYDDB) using the HanYu Dictionary to evaluate the model performance, containing 55,360 character images. The experiment demonstrates that the proposed HYD-CNet has fewer model parameters at a similar accuracy to mainstream lightweight CNNs.
- Research Article
37
- 10.1109/jsen.2023.3244833
- Apr 1, 2023
- IEEE Sensors Journal
Fire accident is a disaster that can happen anytime anywhere due to accidental causes. In existing works, sensor- and computer vision-based approaches have been used for developing the fire detection model, but they fail to attain the accurate results. The sensor-based methods need more time to detect the fire locations and detection coverage also less. The camera sometimes will consider heavy sunlight as fire and it leads to false positive result, which degrades the accuracy. To overcome the above problems, in this research, a novel optimized Gaussian probability-based threshold convolutional neural network (GTCNN) model has been proposed for detecting the fire accidents using various sensors and surveillance camera-based video (SV). Sensor features map has been calculated from various fire sensors and frames/images from SV are preprocessed using a multiscale retinex algorithm. In addition, the Gaussian threshold (GT) logically integrates with the feature map to increase fire pixel count in low-resolution images. The probability results from sensors and SV camera are optimized by multiobjective mayfly optimization (MOMO) algorithm that normalizes the network parameters, which gives the accurate result. The performance of the proposed optimized GTCNN net is different from the existing deep learning networks in terms of multifeature processing. The result of the proposed work attains the detection accuracy of 98.23%. The proposed optimized GTCNN improves the overall accuracy of 3.25%, 3.79%, and 0.21% better than the channel attention mechanism, lightweight CNN, and you only look once (YOLOv5m), respectively.
- Research Article
- 10.71451/istaer2511
- Mar 5, 2025
- International Scientific Technical and Economic Research
Helicopters are critical aerial platforms, and their operational capability in complex environments is crucial. However, their performance in dark and foggy conditions is limited, particularly in ground target recognition using onboard cameras due to poor visibility and lighting conditions. To address this issue, we propose a YOLOv8-based model enhanced to improve ground target recognition in dark and foggy environments. The MS block is a multi-scale feature fusion module that enhances generalization by extracting features at different scales. The improved Residual Mobile Block (iRMB) incorporates attention mechanisms to enhance feature representation. SCINet, a spatial-channel attention-based network, adaptively adjusts feature map weights to improve robustness. UnfogNet, a defogging algorithm, enhances image clarity by removing fog. This integrated approach significantly improves ground target recognition capabilities. Unlike traditional models, AOD-Net generates clean images via a lightweight CNN, making it easily integrable into other deep models. Our MISU-YOLOv8 model outperforms recent state-of-the-art real-time object detectors, including YOLOv7 and YOLOv8, with fewer parameters and FLOPs, improving YOLOv8's Average Precision (AP) from 37% to over 41%. This work can also serve as a plug-and-play module for other YOLO models, this advancement provides robust technical support for helicopter reconnaissance missions in complex environments. **************** ACKNOWLEDGEMENTS**************** Thanks for the data support provided by National-level Innovation Program Project Fund "Research on Seedling Inspection Robot Technology Based on Multi-source Information Fusion and Deep Network" (No.: 202410451009); Jiangsu Provincial Natural Science Research General Project (No.: 20KJB530008); China Society for Smart Engineering "Research on Intelligent Internet of Things Devices and Control Program Algorithms Based on Multi-source Data Analysis" (No.: ZHGC104432); China Engineering Management Association "Comprehensive Application Research on Intelligent Robots and Intelligent Equipment Based on Big Data and Deep Learning" (No.: GMZY2174); Key Project of National Science and Information Technology Department Research Center National Science and Technology Development Research Plan (No.: KXJS71057); Key Project of National Science and Technology Support Program of Ministry of Agriculture (No.: NYF251050).
- Research Article
2
- 10.1016/j.bspc.2025.108425
- Jan 1, 2026
- Biomedical Signal Processing and Control
ShallowMRI: A novel lightweight CNN with novel attention mechanism for Multi brain tumor classification in MRI images
- Research Article
1
- 10.3390/app132212236
- Nov 11, 2023
- Applied Sciences
In recent years, numerous single-image dehazing algorithms have made significant progress; however, dehazing still presents a challenge, particularly in complex real-world scenarios. In fact, single-image dehazing is an inherently ill-posed problem, as scene transmission relies on unknown and nonhomogeneous depth information. This study proposes a novel end-to-end single-image dehazing method called the Integrated Feature Extraction Network (IFE-Net). Instead of estimating the transmission matrix and atmospheric light separately, IFE-Net directly generates the clean image using a lightweight CNN. During the dehazing process, texture details are often lost. To address this issue, an attention mechanism module is introduced in IFE-Net to handle different information impartially. Additionally, a new nonlinear activation function is proposed in IFE-Net, known as a bilateral constrained rectifier linear unit (BCReLU). Extensive experiments were conducted to evaluate the performance of IFE-Net. The results demonstrate that IFE-Net outperforms other single-image haze removal algorithms in terms of both PSNR and SSIM. In the SOTS dataset, IFE-Net achieves a PSNR value of 24.63 and an SSIM value of 0.905. In the ITS dataset, the PSNR value is 25.62, and the SSIM value reaches 0.925. The quantitative results of the synthesized images are either superior to or comparable with those obtained via other advanced algorithms. Moreover, IFE-Net also exhibits significant subjective visual quality advantages.
- Conference Article
- 10.1109/incacct65424.2025.11011390
- Apr 17, 2025
Towards Smart Waste Sorting: Lightweight CNN with Attention Mechanisms for Enhanced Classification
- Research Article
- 10.3389/fpls.2025.1672394
- Nov 6, 2025
- Frontiers in Plant Science
Digital image processing and object recognition are fundamental tasks in sensor-driven intelligent systems. This paper proposes a structure-aware artificial intelligence framework tailored for fine-grained recognition of medicinal plant images captured by visual sensors. Compared with recent herbal recognition approaches such as CNN enhanced with attention mechanisms, cross-modal fusion strategies, and lightweight transformer variants, our method advances the field by jointly integrating graph-based structural modeling, a Bidirectional Semantic Transformer for multi-scale dependency optimization, and a Gradient Optimization Module for gradient-guided refinement. Built upon a Swin-Transformer backbone, the proposed framework effectively enhances semantic discriminability by capturing both spatial and channel-wise dependencies and adaptively reweighting class-discriminative features. To comprehensively validate the framework, we perform experiments on two datasets: (i) the large-scale TCMP-300 benchmark with 52,089 images across 300 categories, where our model achieves 90.32% accuracy, surpassing the Swin-Base baseline by 1.11%; and (ii) a self-constructed herbal dataset containing 1,872 images across 7 classes. Although the latter is relatively small and not intended as a large-scale benchmark, it serves as a challenging evaluation scenario with high intra-class similarity and complex backgrounds, on which our model achieves 92.75% accuracy, improving by 1.18%. These results demonstrate that the proposed framework not only advances beyond prior herbal recognition models but also provides robust, and sensor-adaptable solutions for practical plant21 based applications.
- Conference Article
2
- 10.1109/icce-asia57006.2022.9954810
- Oct 26, 2022
A Digital Sign Language Recognition based on a 3D-CNN System with an Attention Mechanism
- Research Article
2
- 10.1155/2022/6742474
- Sep 19, 2022
- Journal of Sensors
Speech recognition plays an important role in the field of human-computer interaction through the use of acoustic sensors, but speech recognition is technically difficult, has complex overall logic, relies heavily on neural network algorithms, and has extremely high technical requirements. In speech recognition, feature extraction is the first step in speech recognition for recovering and extracting speech features. Existing methods, such as Meier spectral coefficients (MFCC) and spectrograms, lose a large amount of acoustic information and lack biological interpretability. Then, for example, existing speech self-supervised representation learning methods based on contrast prediction need to construct a large number of negative samples during training, and their learning effects depend on large batches of training, which requires a large amount of computational resources for the problem. Therefore, in this paper, we propose a new feature extraction method, called SHH (spike-H), that resembles the human brain and achieves higher speech recognition rates than previous methods. The features extracted using the proposed model are subsequently fed into the classification model. We propose a novel parallel CRNN model with an attention mechanism that considers both temporal and spatial features. Experimental results show that the proposed CRNN achieves an accuracy of 94.8% on the Aurora dataset. In addition, audio similarity experiments show that SHH can better distinguish audio features. In addition, the ablation experiments show that SHH is applicable to digital speech recognition.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.