Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

An intelligent SCNNBN-TBiG hybrid model for casting defect classification

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

ABSTRACT Casting is a fundamental manufacturing process for producing components with complex geometries; however, surface and subsurface defects continue to compromise product reliability andproduction efficiency. To support automated and consistent quality inspection, this paper presents a hybrid deep learning framework termed SCNNBN – TBiG for intelligent casting defect classification. The proposed approach integrates stacked convolutional neural networks with batch normalisation to extract stable and discriminative spatial features, followed by a Transformer encoder that captures long-range contextual relationships through multi-head self-attention. The resulting representations are compressed using global average pooling and subsequently analysed by stacked bidirectional gated recurrent unit layers to model sequential dependencies within the learned feature space. The framework is evaluated on a publicly available industrial casting image dataset comprising 7,348 samples under both defective and non-defective categories. Experimental results demonstrate that the proposed model achieves a testing accuracy of 99.44%, outperforming several existing deep learning and hybrid architectures. The findings confirm that the synergistic integration of spatial, global, and sequential feature learning provides a robust and efficient solution for high-precision industrial quality inspection.

Similar Papers
  • Conference Article
  • Cite Count Icon 6
  • 10.1145/3394171.3413730
Discriminative Spatial Feature Learning for Person Re-Identification
  • Oct 12, 2020
  • Peixi Peng + 4 more

Person re-identification (ReID) aims to match detected pedestrian images from multiple non-overlapping cameras. Most existing methods employ a backbone CNN to extract a vectorized feature representation by performing some global pooling operations (such as global average pooling and global max pooling) on the 3D feature map (i.e., the output of the backbone CNN). Although simple and effective in some situations, the global pooling operation only focuses on the statistical properties and ignores the spatial distribution of the feature map. Hence, it can not distinguish two feature maps when they have similar response values located in totally different positions. To handle this challenge, a novel method is proposed to learn the discriminative spatial features. Firstly, a self-constrained spatial transformer network (SC-STN) is introduced to handle the misalignments caused by detection errors. Then, based on the prior knowledge that the spatial structure of a pedestrian often keeps robust in vertical orientation of images, a novel vertical convolution network (VCN) is proposed to extract the spatial feature in vertical. Extensive experimental evaluations on several benchmarks demonstrate that the proposed method achieves state-of-the-art performances by introducing only a few parameters to the backbone.

  • Research Article
  • Cite Count Icon 74
  • 10.1016/j.ecolmodel.2022.110166
Rice plant disease classification using dilated convolutional neural network with global average pooling
  • Oct 8, 2022
  • Ecological Modelling
  • S Senthil Pandi + 5 more

Rice plant disease classification using dilated convolutional neural network with global average pooling

  • Research Article
  • 10.3390/math13040660
EHAFF-NET: Enhanced Hybrid Attention and Feature Fusion for Pedestrian ReID
  • Feb 17, 2025
  • Mathematics
  • Jun Yang + 5 more

This study addresses the cross-scenario challenges in pedestrian re-identification for public safety, including perspective differences, lighting variations, occlusions, and vague feature expressions. We propose a pedestrian re-identification method called EHAFF-NET, which integrates an enhanced hybrid attention mechanism and multi-branch feature fusion. We introduce the Enhanced Hybrid Attention Module (EHAM), which combines channel and spatial attention mechanisms. The channel attention mechanism uses self-attention to capture long-range dependencies and extracts multi-scale local features with convolutional kernels and channel shuffling. The spatial attention mechanisms aggregate features using global average and max pooling to enhance spatial representation. To tackle issues like perspective differences, lighting changes, and occlusions, we incorporate the Multi-Branch Feature Integration module. The global branch captures overall information with global average pooling, while the local branch integrates features from different layers via the Diverse-Depth Feature Integration Module (DDFIM) to extract multi-scale semantic information. It also extracts features based on human proportions, balancing high-level semantics and low-level details. Experiments demonstrate that our model achieves a mAP of 92.5% and R1 of 94.7% on the Market-1501 dataset, a mAP of 85.4% and R1 of 88.6% on the DukeMTMC-reID dataset, and a mAP of 49.1% and R1 of 73.8% on the MSMT17 dataset, demonstrating significant accuracy advantages over several advanced models.

  • Research Article
  • Cite Count Icon 32
  • 10.1093/dmfr/twad003
Comparison of deep learning methods for the radiographic detection of patients with different periodontitis stages.
  • Dec 13, 2023
  • Dento maxillo facial radiology
  • Berceste Guler Ayyildiz + 3 more

The objective of this study is to assess the accuracy of computer-assisted periodontal classification bone loss staging using deep learning (DL) methods on panoramic radiographs and to compare the performance of various models and layers. Panoramic radiographs were diagnosed and classified into 3 groups, namely "healthy," "Stage1/2," and "Stage3/4," and stored in separate folders. The feature extraction stage involved transferring and retraining the feature extraction layers and weights from 3 models, namely ResNet50, DenseNet121, and InceptionV3, which were proposed for classifying the ImageNet dataset, to 3 DL models designed for classifying periodontal bone loss. The features obtained from global average pooling (GAP), global max pooling (GMP), or flatten layers (FL) of convolutional neural network (CNN) models were used as input to the 8 different machine learning (ML) models. In addition, the features obtained from the GAP, GMP, or FL of the DL models were reduced using the minimum redundancy maximum relevance (mRMR) method and then classified again with 8 ML models. A total of 2533 panoramic radiographs, including 721 in the healthy group, 842 in the Stage1/2 group, and 970 in the Stage3/4 group, were included in the dataset. The average performance values of DenseNet121 + GAP-based and DenseNet121 + GAP + mRMR-based ML techniques on 10 subdatasets and ML models developed using 2 feature selection techniques outperformed CNN models. The new DenseNet121 + GAP + mRMR-based support vector machine model developed in this study achieved higher performance in periodontal bone loss classification compared to other models in the literature by detecting effective features from raw images without the need for manual selection.

  • Conference Article
  • Cite Count Icon 4
  • 10.21437/interspeech.2020-2791
A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling
  • Oct 25, 2020
  • Chieh-Chi Kao + 3 more

This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a well-known framework using GAP to localize visual objects given image-level labels. While most of the previous works on weakly supervised AED used recurrent layers with attention-based mechanism to localize acoustic events, the proposed network directly localizes events using the feature map extracted by DenseNet without any recurrent layers. In the audio tagging task of DCASE 2017, our method significantly outperforms the state-of-the-art method in F1 score by 5.3% on the dev set, and 6.0% on the eval set in terms of absolute values. For weakly supervised AED task in DCASE 2018, our model outperforms the state-of-the-art method in event-based F1 by 8.1% on the dev set, and 0.5% on the eval set in terms of absolute values, by using data augmentation and tri-training to leverage unlabeled data.

  • Research Article
  • Cite Count Icon 33
  • 10.1080/21681163.2022.2060864
Deep hybrid architectures for diabetic retinopathy classification
  • Apr 18, 2022
  • Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization
  • Chaymaa Lahmar + 1 more

Diabetic retinopathy (DR) is the most severe ocular complication of diabetes. It leads to serious eye complications such as vision impairment and blindness. A computer-aided diagnosis may help in the early detection of this disease, which increases the chances of treating it efficiently. This paper carried out an empirical evaluation of the performances of 28 deep hybrid architectures for an automatic binary classification of the referable diabetic retinopathy, and compared them to seven end-to-end deep learning (DL) architectures. For the hybrid architectures, we combined seven DL techniques for feature extraction (DenseNet201, VGG16, VGG19, MobileNet_V2, Inception_V3, Inception_ResNet_V2 and ResNet50) and four classifiers (SVM, MLP, DT and KNN). For the end-to-end DL architectures, we used the same techniques used for the feature extraction in the hybrid architectures. The architectures were compared in terms of accuracy, sensitivity, precision and F1-score using the Scott Knott test and the Borda count voting method. All the empirical evaluations were over three datasets: APTOS, Kaggle DR and Messidor-2, using a k-fold cross validation method. The results showed the potential of combining deep learning techniques for feature extraction and classical machine learning techniques to classify referable diabetic retinopathy. The hybrid architecture using the SVM classifier and MobileNet_V2 for feature extraction was the top performing architecture and it was classified with the best performing end-to-end architectures in the best clusters of APTOS, Kaggle DR and Messidor-2 datasets with an accuracy equal to 88.80%, 84.01% and 84.05% respectively. Note that the two end-to-end architectures DenseNet201 and MobileNet_V2 outperformed all the hybrid architectures over the three datasets. However, we recommend the use of the hybrid architecture designed with SVM and MobileNet_V2 since it is promising and less time consuming, and requires less parameter tuning compared to the end-to-end techniques.

  • Research Article
  • 10.1007/s10278-025-01738-6
Ensembling Vision Transformers and ResNet-50 for Interpretable Lung Cancer Diagnosis with Feature Fusion and XAI Techniques.
  • Nov 13, 2025
  • Journal of imaging informatics in medicine
  • Rahul + 3 more

Lung cancer remains a leading cause of cancer-related mortality, primarily due to diagnostic inconsistencies and limitations of conventional methods. This study addresses the critical need for accurate, transparent, and clinically viable diagnostic systems by proposing a novel deep learning framework for histopathological lung cancer classification. Our research introduces a hybrid ensemble architecture that combines the hierarchical feature extraction capabilities of ResNet-50 with the global contextual understanding of Vision Transformer (ViT). Input images are processed in parallel through both pathways: ResNet-50 extracts 2048-dimensional spatial features via convolutional and residual blocks followed by global average pooling, while ViT generates 768-dimensional features from patch embeddings and a transformer encoder. These features are then fused into a 2816-dimensional combined vector, which is fed into a classification head comprising three fully connected layers with Batch Normalization, ReLU activation, and Dropout regularization, culminating in a 3-class softmax output. The ensemble model demonstrated superior performance, achieving a mean cross-validation accuracy of 99.96% ± 0.0004%, a holdout test set accuracy of 99.94%, and a separate test set accuracy of 99.82%. Furthermore, the integration of a multi-disciplinary Explainable AI (XAI) strategy, including Grad-CAM, LIME, SHAP, Saliency Maps, Integrated Gradients, and Occlusion Sensitivity, provided crucial interpretability, with attention heatmaps showing 87.3% overlap with pathologist-identified regions of interest. This work significantly advances AI-assisted lung cancer diagnosis by offering a robust, highly accurate, and interpretable solution that addresses the current clinical gaps and holds huge potential for improving patient outcomes.

  • Research Article
  • Cite Count Icon 4
  • 10.1155/2024/5818803
A New Method to Optimize Deep CNN Model for Classification of Regular Cucumber Based on Global Average Pooling
  • Jan 1, 2024
  • Journal of Food Processing and Preservation
  • Sajad Haseli Golzar + 2 more

Traditional methods of separating defective cucumbers are inherently labor‐intensive and time‐consuming. However, with the emergence of intelligent farming practices, deep learning (DL) algorithms, particularly in the fields of image processing and machine vision, have demonstrated significant potential to address this challenge. The main objective of this research study is to develop a DL‐based algorithm capable of classifying cucumbers into three distinct categorical groups based on their visual characteristics: defective, curved, and sound (straight green). For this purpose, in addition to inspect the more accurate InceptionResNetV2 as a transfer learning method, the modified convolutional neural network (CNN) (MCNN) incorporating global average pooling (GAP) was proposed to streamline the architecture and minimize trainable parameters. The results demonstrate that the accuracy of CNN with the GAP layer outperforms the fully connected (FC) layer (FCL). The accuracies for the proposed CNN with GAP, proposed CNN with FCL, and InceptionResNetV2 were 94.14%, 92.92%, and 91.21%, respectively, highlighting the efficiency of the CNN with GAP in cucumber classification and its potential to replace conventional grading methods. The overall results indicated that the implementation of dropout did not yield any improvements for the developed models. Rather, the best performance of the CNNs was achieved when utilizing 64 neurons in the hidden layer.

  • Research Article
  • 10.22266/ijies2026.0228.50
Multichannel EEG based Hybrid Gated Separable Convolution and Transformer Encoder for Sleep Stage Epilepsy Prediction
  • Feb 28, 2026
  • International Journal of Intelligent Engineering and Systems

Sleep stage epilepsy represents a critical neurological disorder where seizures occur during distinct phases of sleep, disrupting normal brain function and sleep architecture.Automated and reliable prediction of sleep-related epilepsy is therefore essential to support early diagnosis, long-term monitoring, and clinical decision-making.This study presents a novel hybrid Gated Separable Convolution Network (GSCN)-Transformer encoder model designed to predict sleep stage epilepsy from multichannel EEG recordings.The architecture integrates two complementary modules: the GSCN block extracts localized temporal features through depthwise separable convolutions and a gating mechanism, while the Transformer encoder captures long-range dependencies and inter-channel relationships using multi-head self-attention (MHSA).This integration ensures that both fine-grained temporal details and broader contextual dependencies are effectively represented, resulting in a unified feature space optimized for classification.The proposed framework was evaluated on the Siena Sleep EEG dataset using a rigorously designed preprocessing pipeline and stratified train-test splitting, achieving a high classification accuracy of 98.78%, with precision, recall, and F1-score of 98.79%.Subject-wise cross-validation further confirmed the model's robustness, yielding a mean accuracy exceeding 98% with narrow confidence intervals.Statistical significance analysis using DeLong's and McNemar's tests demonstrated that the proposed model significantly outperforms conventional machine learning and deep learning baselines.Additionally, external benchmarking on the CHB-MIT Scalp EEG dataset without retraining achieved an accuracy of 95.2%, confirming great cross-dataset generalization.The results confirm the framework's suitability for clinical decision support, continuous sleep monitoring and integration into real-time healthcare applications for epilepsy management.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 12
  • 10.1155/2021/2251530
Multisource Data Fusion Diagnosis Method of Rolling Bearings Based on Improved Multiscale CNN
  • Jan 1, 2021
  • Journal of Sensors
  • Yulin Jin + 2 more

Intelligent diagnosis applies deep learning algorithms to mechanical fault diagnosis, which can classify the fault forms of machines or parts efficiently. At present, the intelligent diagnosis of rolling bearings mostly adopts a single‐sensor signal, and multisensor information can provide more comprehensive fault features for the deep learning model to improve the generalization ability. In order to apply multisensor information more effectively, this paper proposes a multiscale convolutional neural network model based on global average pooling. The diagnostic model introduces a multiscale convolution kernel in the feature extraction process, which improves the robustness of the model. Meanwhile, its parallel structure also makes up for the shortcomings of the multichannel input fusion method. In the multiscale fusion process, the global average pooling method is used to replace the way to reshape the feature maps into a one‐dimensional feature vector in the traditional convolutional neural network, which effectively retains the spatial structure of the feature maps. The model proposed in this paper has been verified by the bearing fault data collected by the experimental platform. The experimental results show that the algorithm proposed in this paper can fuse multisensor data effectively. Compared with other data fusion algorithms, the multiscale convolutional neural network model based on global average pooling has shorter training epochs and better fault diagnosis results.

  • Research Article
  • Cite Count Icon 44
  • 10.1109/tgrs.2022.3159789
Unsupervised Spectral–Spatial Semantic Feature Learning for Hyperspectral Image Classification
  • Jan 1, 2022
  • IEEE Transactions on Geoscience and Remote Sensing
  • Huilin Xu + 3 more

Can we automatically learn meaningful semantic feature representations when training labels are absent? Several recent unsupervised deep learning approaches have attempted to tackle this problem by solving the data reconstruction task. However, these methods can easily latch on low-level features. To solve this problem, we propose an end-to-end spectral–spatial semantic feature learning network (S3FN) for unsupervised deep semantic feature extraction (FE) from hyperspectral images (HSIs). Our main idea is to learn spectral-spatial features from high-level semantic perspective. First, we utilize the feature transformation to obtain two feature descriptions of the same source data from different views. Then, we propose the spectral–spatial feature learning network to project the two feature descriptions into the deep embedding space. Subsequently, a contrastive loss function is introduced to align the two projected features, which should have the same implied semantic meaning. The proposed S3FN learns the spectral and spatial features separately, and then merges them. Finally, the learned spectral–spatial features by S3FN are processed by a classifier to evaluate their effectiveness. Experimental results on three publicly available HSI datasets show that our proposed S3FN can produce promising classification results with a lower time cost than other state-of-the-art (SOTA) deep learning-based unsupervised FE methods.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/ictcs.2019.8923097
DeepDR: An image guideddiabetic retinopathy detection technique using attention-based deep learning scheme
  • Oct 1, 2019
  • Noman Islam + 5 more

This paper proposes an efficient and cost effective deep learning architecture to detect the diabetic retinopathy in real time. Diabetes is a leading root cause of eye disease in patients. It illuminates eye vessels, and releases blood form vessels. Early detection of diabetic retinopathy is useful to reduce the risk of blindness or any hazard. In this paper, after some preprocessing and data augmentation, Inception V3 is used as pre-trained model to extract the initial features set. Convolutional neural network has been used with attention layers. These additional CNN layers are added to extract the deep features to improve classification performance and accuracy. Initially, the model was proposed by Kevin Mader in Kaggle. The paper introduced additional layers in proposed model and improved the validation and testing accuracy significantly. More than 90% validation accuracy was achieved with the proposed Convolutional Neural Network model. Testing accuracy was improved up to 5%. This improvement in accuracy is very significant because the dataset is imbalanced and contains noisy images. It is concluded that global average pooling (GAP) based attention mechanism increased deep learning architecture accuracy to detect the Diabetic Retinopathy in imbalanced and noisy image dataset

  • Research Article
  • Cite Count Icon 7
  • 10.1080/09720510.2019.1609554
Empirical evaluation of deep learning models for sentiment analysis
  • May 19, 2019
  • Journal of Statistics and Management Systems
  • Ajeet Ram Pathak + 2 more

The availability of computing resources and generation of large scale data emanating from Artificial Intelligence, Internet of Things and social media platforms have resulted into resurgence of deep learning technology. Deep learning architectures have been successfully adopted to solve the problems arising in variety of domains such as computer vision, information retrieval, robotics, and natural language processing, etc. Due to inherent ability of deep architectures to extract hierarchical structures from complex multimedia data, they have been widely used for the tasks of classification, regression and prediction. Motivated by the same, this paper addresses the problem of identifying the subjective information from text documents and predicting the sentiments at sentence level using deep feedforward neural network with global average pooling and long short term memory model with dense layers. The experimentation details state that both models are on par and provide good accuracy on the benchmarked dataset of sentiment classification.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.3389/fcimb.2023.1116285
Computed tomography-based COVID-19 triage through a deep neural network using mask-weighted global average pooling.
  • Mar 3, 2023
  • Frontiers in Cellular and Infection Microbiology
  • Hong-Tao Zhang + 11 more

There is an urgent need to find an effective and accurate method for triaging coronavirus disease 2019 (COVID-19) patients from millions or billions of people. Therefore, this study aimed to develop a novel deep-learning approach for COVID-19 triage based on chest computed tomography (CT) images, including normal, pneumonia, and COVID-19 cases. A total of 2,809 chest CT scans (1,105 COVID-19, 854 normal, and 850 non-3COVID-19 pneumonia cases) were acquired for this study and classified into the training set (n = 2,329) and test set (n = 480). A U-net-based convolutional neural network was used for lung segmentation, and a mask-weighted global average pooling (GAP) method was proposed for the deep neural network to improve the performance of COVID-19 classification between COVID-19 and normal or common pneumonia cases. The results for lung segmentation reached a dice value of 96.5% on 30 independent CT scans. The performance of the mask-weighted GAP method achieved the COVID-19 triage with a sensitivity of 96.5% and specificity of 87.8% using the testing dataset. The mask-weighted GAP method demonstrated 0.9% and 2% improvements in sensitivity and specificity, respectively, compared with the normal GAP. In addition, fusion images between the CT images and the highlighted area from the deep learning model using the Grad-CAM method, indicating the lesion region detected using the deep learning method, were drawn and could also be confirmed by radiologists. This study proposed a mask-weighted GAP-based deep learning method and obtained promising results for COVID-19 triage based on chest CT images. Furthermore, it can be considered a convenient tool to assist doctors in diagnosing COVID-19.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 9
  • 10.1007/s44196-025-00848-x
A Novel Human Action Recognition Model by Grad-CAM Visualization with Multi-level Feature Extraction Using Global Average Pooling with Sequence Modeling by Bidirectional Gated Recurrent Units
  • May 15, 2025
  • International Journal of Computational Intelligence Systems
  • Jayamohan Manoharan + 1 more

Human action recognition is essential in many real-world scenarios, such as video surveillance, human–computer interaction, and behavior analysis. Despite the progress in deep learning, issues such as occlusion, distraction from the background, and motion pattern variability still exist, thus restricting the generalization ability of current models. Most methods are based only on spatial or temporal features and cannot efficiently capture both in one framework, causing lower accuracy in realistic situations. In response to these shortcomings, a multilevel feature extraction approach was proposed by integrating spatial and temporal features to improve the action recognition precision. The method captures RGB frames, optical flow, spatial saliency maps, and temporal saliency maps to enable an overall inspection of video streams. Efficient feature extraction was achieved by applying a pre-trained Inception V3 model and then bidirectional gated recurrent units (Bi-GRUs) to include sequential modeling. An attention mechanism was also included to boost the classification process by focusing on key temporal segments. UCF101 and HMDB51 benchmark datasets evaluated the efficiency of the strategy. The model’s accuracy was 98.13% on UCF101 and 81.45% on HMDB51, which validated the superior discrimination ability of the model in processing heterogeneous human actions. These results confirm that the provided framework is an efficient and discriminative action recognition approach, thus suitable for applications requiring extensive motion analysis and real-time deployment.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant