Enhancing digital art style recognition via a hybrid vision transformer and lightweight CNN with attention mechanisms

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Enhancing digital art style recognition via a hybrid vision transformer and lightweight CNN with attention mechanisms

Similar Papers
  • Research Article
  • 10.3390/buildings15020176
Innovative Framework for Historical Architectural Recognition in China: Integrating Swin Transformer and Global Channel–Spatial Attention Mechanism
  • Jan 9, 2025
  • Buildings
  • Jiade Wu + 3 more

The digital recognition and preservation of historical architectural heritage has become a critical challenge in cultural inheritance and sustainable urban development. While deep learning methods show promise in architectural classification, existing models often struggle to achieve ideal results due to the complexity and uniqueness of historical buildings, particularly the limited data availability in remote areas. Focusing on the study of Chinese historical architecture, this research proposes an innovative architectural recognition framework that integrates the Swin Transformer backbone with a custom-designed Global Channel and Spatial Attention (GCSA) mechanism, thereby substantially enhancing the model’s capability to extract architectural details and comprehend global contextual information. Through extensive experiments on a constructed historical building dataset, our model achieves an outstanding performance of over 97.8% in key metrics including accuracy, precision, recall, and F1 score (harmonic mean of the precision and recall), surpassing traditional CNN (convolutional neural network) architectures and contemporary deep learning models. To gain deeper insights into the model’s decision-making process, we employed comprehensive interpretability methods including t-SNE (t-distributed Stochastic Neighbor Embedding), Grad-CAM (gradient-weighted class activation mapping), and multi-layer feature map analysis, revealing the model’s systematic feature extraction process from structural elements to material textures. This study offers substantial technical support for the digital modeling and recognition of architectural heritage in historical buildings, establishing a foundation for heritage damage assessment. It contributes to the formulation of precise restoration strategies and provides a scientific basis for governments and cultural heritage institutions to develop region-specific policies for conservation efforts.

  • Conference Article
  • 10.1117/12.2640778
Toward more efficient iris recognition using a lightweight CNN framework with attention mechanism
  • Oct 3, 2022
  • Qinhong Zou + 4 more

Iris recognition is considered as one of the most promising biometrics due to its discriminative features and friendly acquisition methods. Herein, a deep learning-based method is proposed to achieve more accurate and efficient iris recognition. The proposed framework Iris Attention Network (IrisAttenNet) integrates the attention mechanism into a lightweight CNN to extract iris features more specifically. In the process of feature learning, the channel features with more information that contribute to the recognition result will attract more attention and be given higher weights, which is similar to the human visual perception mechanism. The performance of the proposed framework is evaluated by four publicly available datasets representing different intra-class variations: CASIA_Iris_V4 Interval, Lamp, Thousand and UBIRIS.v1. The experimental results have demonstrated that the approach based on the IrisAttenNet shows higher accuracy, stronger generalization and less computational cost. The intermediate outcomes heat maps have proved that the key contribution of the attention module through visualization of the feature areas of images.

  • Research Article
  • Cite Count Icon 21
  • 10.1016/j.compbiomed.2023.106606
One-stage and lightweight CNN detection approach with attention: Application to WBC detection of microscopic images
  • Jan 23, 2023
  • Computers in Biology and Medicine
  • Zhenggong Han + 7 more

One-stage and lightweight CNN detection approach with attention: Application to WBC detection of microscopic images

  • Research Article
  • 10.14569/ijacsa.2026.0170142
Attention-Guided Lightweight MobileNetV2 for Real-Time Driver Drowsiness Classification on Edge-IoT Systems
  • Jan 1, 2026
  • International Journal of Advanced Computer Science and Applications
  • Yo Ceng Giap + 6 more

Driver drowsiness is a major cause of traffic accidents, so Edge-IoT platforms with limited resources need to be able to accurately and quickly detect when drivers are drowsy. This study examines attention-guided lightweight CNN design predicated on MobileNetV2 for real-time driver drowsiness detection. The authors compare a SE-enhanced MobileNetV2 to the baseline model and a structurally optimized version that uses Depthwise Separable Convolution (DSC), Bottleneck blocks, and Expansion layers. Experiments on 500 images demonstrate that channel attention enhances feature discrimination, whereas structural optimization yields the most resilient trade-off between accuracy and latency. Statistical validation employing 95% confidence intervals and two-proportion Z-tests substantiates the significance of these enhancements. The proposed models support real-time inference despite their small size (about 2.6 million parameters and 315 million FLOPs). These findings suggest structural optimization is more important than attention mechanisms in designing lightweight CNNs for embedded driver monitoring.

  • Preprint Article
  • 10.21203/rs.3.rs-4536797/v1
BDR-Net: Digital Recognition Network for Billet Surface Based on Flow Alignment and Attention Mechanism
  • Jun 20, 2024
  • Jinyu Xu + 2 more

The surface characteristics of billets are crucial for subsequent traceability, yet the production process generates intricate digital features on their surfaces. This paper introduces BDR-Net, a novel billet surface digit recognition network. Drawing inspiration from Inception, the network adopts a ResNext-like architecture as its primary framework. It uniformly distributes output in dimension, extracts positional and scale features separately, and introduces a mixed dilated convolution block to reduce parameters while expanding the sensory field. To address the challenge of lost up-sampled features during fusion, an innovative stream alignment-based up-sampled feature fusion algorithm is proposed. Additionally, to enhance the network's focus on extracting salient spatial and channel features, a mixed-dimensional attention mechanism (scSE) is integrated into the alignment-based upsampling feature fusion module. Experimental results showcase BDR-Net's outstanding performance, achieving an impressive 95.6\% accuracy in digitally classifying billet surfaces, surpassing the ResNext50\_32x4d benchmark model by 4.3\% in recognition accuracy. Moreover, compared to current classification networks, this model exhibits significant accuracy improvements. Furthermore, the mAP@0.95 metric reaches 0.897, surpassing current classification networks. These findings underscore the remarkable performance of the model in billet surface digit recognition, offering an effective solution for digit recognition on billet surfaces in steel mills.

  • Research Article
  • Cite Count Icon 24
  • 10.1109/lgrs.2020.3031593
Convolutional Neural Network With Attention Mechanism for SAR Automatic Target Recognition
  • Nov 2, 2020
  • IEEE Geoscience and Remote Sensing Letters
  • Ming Zhang + 5 more

Synthetic aperture radar automatic target recognition (SAR ATR) is a key technique of remote-sensing image recognition, which has many potential applications in the fields of military surveillance, national defense, civil application, and so on. With the development of science and technology, deep convolutional neural network (DCNN) has been widely applied for SAR ATR. However, it is difficult to use deep learning to train models with limited ray SAR images. To resolve this problem, we proposed an effectively lightweight attention mechanism CNN (AM-CNN) model for SAR ATR. Extensive experimental results on the Moving and Stationary Target Acquisition and Recognition (MSTAR) data set illustrate that the AM-CNN model can achieve a superior recognition performance, and the average recognition accuracy can reach 99.35% on the classification of 10 class targets. Compared with the traditional CNN and the state-of-the-art method, our model is significantly superior to improve performance and efficiency.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ctisc54888.2022.9849794
A Lightweight CNN for Large-scale Chinese Character Recognition
  • Apr 22, 2022
  • Junwei Zhou + 2 more

The ancient Chinese characters appear in various historical documents and poetry. People tend to use optical character recognition tools to understand these uncommon characters. The current Chinese text recognition interface is restricted to a limited character set, such as GB2312-80 and GB18010-2005 standard. However, the newest HanYu Dictionary contains over 55K characters, much more than the commonly-used character set. This work proposes a compact deep network (HYD-CNet) composed of depthwise separable convolutional blocks and co-ordinate attention mechanism to recognize the ancient Chinese characters. It can achieve efficient retrieval and low-storage need for large-scale character recognition on mobile devices. We build a Chinese character database (HYDDB) using the HanYu Dictionary to evaluate the model performance, containing 55,360 character images. The experiment demonstrates that the proposed HYD-CNet has fewer model parameters at a similar accuracy to mainstream lightweight CNNs.

  • Research Article
  • Cite Count Icon 37
  • 10.1109/jsen.2023.3244833
Fire Sensor and Surveillance Camera-Based GTCNN for Fire Detection System
  • Apr 1, 2023
  • IEEE Sensors Journal
  • P Sridhar + 3 more

Fire accident is a disaster that can happen anytime anywhere due to accidental causes. In existing works, sensor- and computer vision-based approaches have been used for developing the fire detection model, but they fail to attain the accurate results. The sensor-based methods need more time to detect the fire locations and detection coverage also less. The camera sometimes will consider heavy sunlight as fire and it leads to false positive result, which degrades the accuracy. To overcome the above problems, in this research, a novel optimized Gaussian probability-based threshold convolutional neural network (GTCNN) model has been proposed for detecting the fire accidents using various sensors and surveillance camera-based video (SV). Sensor features map has been calculated from various fire sensors and frames/images from SV are preprocessed using a multiscale retinex algorithm. In addition, the Gaussian threshold (GT) logically integrates with the feature map to increase fire pixel count in low-resolution images. The probability results from sensors and SV camera are optimized by multiobjective mayfly optimization (MOMO) algorithm that normalizes the network parameters, which gives the accurate result. The performance of the proposed optimized GTCNN net is different from the existing deep learning networks in terms of multifeature processing. The result of the proposed work attains the detection accuracy of 98.23%. The proposed optimized GTCNN improves the overall accuracy of 3.25%, 3.79%, and 0.21% better than the channel attention mechanism, lightweight CNN, and you only look once (YOLOv5m), respectively.

  • Research Article
  • 10.71451/istaer2511
Methods for Ground Target Recognition from an Aerial Camera on a Helicopter Using the MISU-YOLOv8 Model in Dark and Foggy Environments
  • Mar 5, 2025
  • International Scientific Technical and Economic Research
  • Houbin Wang + 5 more

Helicopters are critical aerial platforms, and their operational capability in complex environments is crucial. However, their performance in dark and foggy conditions is limited, particularly in ground target recognition using onboard cameras due to poor visibility and lighting conditions. To address this issue, we propose a YOLOv8-based model enhanced to improve ground target recognition in dark and foggy environments. The MS block is a multi-scale feature fusion module that enhances generalization by extracting features at different scales. The improved Residual Mobile Block (iRMB) incorporates attention mechanisms to enhance feature representation. SCINet, a spatial-channel attention-based network, adaptively adjusts feature map weights to improve robustness. UnfogNet, a defogging algorithm, enhances image clarity by removing fog. This integrated approach significantly improves ground target recognition capabilities. Unlike traditional models, AOD-Net generates clean images via a lightweight CNN, making it easily integrable into other deep models. Our MISU-YOLOv8 model outperforms recent state-of-the-art real-time object detectors, including YOLOv7 and YOLOv8, with fewer parameters and FLOPs, improving YOLOv8's Average Precision (AP) from 37% to over 41%. This work can also serve as a plug-and-play module for other YOLO models, this advancement provides robust technical support for helicopter reconnaissance missions in complex environments. **************** ACKNOWLEDGEMENTS**************** Thanks for the data support provided by National-level Innovation Program Project Fund "Research on Seedling Inspection Robot Technology Based on Multi-source Information Fusion and Deep Network" (No.: 202410451009); Jiangsu Provincial Natural Science Research General Project (No.: 20KJB530008); China Society for Smart Engineering "Research on Intelligent Internet of Things Devices and Control Program Algorithms Based on Multi-source Data Analysis" (No.: ZHGC104432); China Engineering Management Association "Comprehensive Application Research on Intelligent Robots and Intelligent Equipment Based on Big Data and Deep Learning" (No.: GMZY2174); Key Project of National Science and Information Technology Department Research Center National Science and Technology Development Research Plan (No.: KXJS71057); Key Project of National Science and Technology Support Program of Ministry of Agriculture (No.: NYF251050).

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.bspc.2025.108425
ShallowMRI: A novel lightweight CNN with novel attention mechanism for Multi brain tumor classification in MRI images
  • Jan 1, 2026
  • Biomedical Signal Processing and Control
  • Saif Ur Rehman Khan + 5 more

ShallowMRI: A novel lightweight CNN with novel attention mechanism for Multi brain tumor classification in MRI images

  • Research Article
  • Cite Count Icon 1
  • 10.3390/app132212236
IFE-Net: An Integrated Feature Extraction Network for Single-Image Dehazing
  • Nov 11, 2023
  • Applied Sciences
  • Can Leng + 1 more

In recent years, numerous single-image dehazing algorithms have made significant progress; however, dehazing still presents a challenge, particularly in complex real-world scenarios. In fact, single-image dehazing is an inherently ill-posed problem, as scene transmission relies on unknown and nonhomogeneous depth information. This study proposes a novel end-to-end single-image dehazing method called the Integrated Feature Extraction Network (IFE-Net). Instead of estimating the transmission matrix and atmospheric light separately, IFE-Net directly generates the clean image using a lightweight CNN. During the dehazing process, texture details are often lost. To address this issue, an attention mechanism module is introduced in IFE-Net to handle different information impartially. Additionally, a new nonlinear activation function is proposed in IFE-Net, known as a bilateral constrained rectifier linear unit (BCReLU). Extensive experiments were conducted to evaluate the performance of IFE-Net. The results demonstrate that IFE-Net outperforms other single-image haze removal algorithms in terms of both PSNR and SSIM. In the SOTS dataset, IFE-Net achieves a PSNR value of 24.63 and an SSIM value of 0.905. In the ITS dataset, the PSNR value is 25.62, and the SSIM value reaches 0.925. The quantitative results of the synthesized images are either superior to or comparable with those obtained via other advanced algorithms. Moreover, IFE-Net also exhibits significant subjective visual quality advantages.

  • Conference Article
  • 10.1109/incacct65424.2025.11011390
Towards Smart Waste Sorting: Lightweight CNN with Attention Mechanisms for Enhanced Classification
  • Apr 17, 2025
  • Rashi Chauhan + 3 more

Towards Smart Waste Sorting: Lightweight CNN with Attention Mechanisms for Enhanced Classification

  • Research Article
  • 10.3389/fpls.2025.1672394
AI-powered recognition of Chinese medicinal herbs with semantic structure modeling and gradient-guided enhancement
  • Nov 6, 2025
  • Frontiers in Plant Science
  • Jiaxing Zou + 1 more

Digital image processing and object recognition are fundamental tasks in sensor-driven intelligent systems. This paper proposes a structure-aware artificial intelligence framework tailored for fine-grained recognition of medicinal plant images captured by visual sensors. Compared with recent herbal recognition approaches such as CNN enhanced with attention mechanisms, cross-modal fusion strategies, and lightweight transformer variants, our method advances the field by jointly integrating graph-based structural modeling, a Bidirectional Semantic Transformer for multi-scale dependency optimization, and a Gradient Optimization Module for gradient-guided refinement. Built upon a Swin-Transformer backbone, the proposed framework effectively enhances semantic discriminability by capturing both spatial and channel-wise dependencies and adaptively reweighting class-discriminative features. To comprehensively validate the framework, we perform experiments on two datasets: (i) the large-scale TCMP-300 benchmark with 52,089 images across 300 categories, where our model achieves 90.32% accuracy, surpassing the Swin-Base baseline by 1.11%; and (ii) a self-constructed herbal dataset containing 1,872 images across 7 classes. Although the latter is relatively small and not intended as a large-scale benchmark, it serves as a challenging evaluation scenario with high intra-class similarity and complex backgrounds, on which our model achieves 92.75% accuracy, improving by 1.18%. These results demonstrate that the proposed framework not only advances beyond prior herbal recognition models but also provides robust, and sensor-adaptable solutions for practical plant21 based applications.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icce-asia57006.2022.9954810
A Digital Sign Language Recognition based on a 3D-CNN System with an Attention Mechanism
  • Oct 26, 2022
  • Ying Ma + 2 more

A Digital Sign Language Recognition based on a 3D-CNN System with an Attention Mechanism

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1155/2022/6742474
Use Brain-Like Audio Features to Improve Speech Recognition Performance
  • Sep 19, 2022
  • Journal of Sensors
  • Junyi Wang + 2 more

Speech recognition plays an important role in the field of human-computer interaction through the use of acoustic sensors, but speech recognition is technically difficult, has complex overall logic, relies heavily on neural network algorithms, and has extremely high technical requirements. In speech recognition, feature extraction is the first step in speech recognition for recovering and extracting speech features. Existing methods, such as Meier spectral coefficients (MFCC) and spectrograms, lose a large amount of acoustic information and lack biological interpretability. Then, for example, existing speech self-supervised representation learning methods based on contrast prediction need to construct a large number of negative samples during training, and their learning effects depend on large batches of training, which requires a large amount of computational resources for the problem. Therefore, in this paper, we propose a new feature extraction method, called SHH (spike-H), that resembles the human brain and achieves higher speech recognition rates than previous methods. The features extracted using the proposed model are subsequently fed into the classification model. We propose a novel parallel CRNN model with an attention mechanism that considers both temporal and spatial features. Experimental results show that the proposed CRNN achieves an accuracy of 94.8% on the Aurora dataset. In addition, audio similarity experiments show that SHH can better distinguish audio features. In addition, the ablation experiments show that SHH is applicable to digital speech recognition.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.