MFPIDet: improved YOLOV7 architecture based on multi-scale feature fusion for prohibited item detection in complex environment
Prohibited item detection is crucial for the safety of public places. Deep learning, one of the mainstream methods in prohibited item detection tasks, has shown superior performance far beyond traditional prohibited item detection methods. However, most neural network architectures in deep learning still lack sufficient local feature representation ability for overlapping and small targets, and ignore the problem of semantic conflicts caused by direct feature fusion. In this paper, we propose MFPIDet, a novel prohibited item detection neural network architecture based on improved YOLOV7 to achieve reliable prohibited item detection in complex environments. Specifically, a multi-scale attention module (MAM) backbone is proposed to filter the redundant information of target regions and further applied to enhance the local feature representation ability of overlapping objects. Here, to reduce the redundant information of target regions, a squeeze-excitation (SE) block is used to filter the background. Then, aiming at enhancing the feature expression ability of overlapping objects, a multi-scale feature extraction module (MFEM) is designed for local feature representation. In addition, to obtain richer context information, We design an adaptive fusion feature pyramid network (AF-FPN) to combine the adaptive context information fusion module (ACIFM) with the feature fusion module (FFM) to improve the neck structure of YOLOV7. The proposed method is validated on the PIDray dataset, and the tested results showed that our method obtained the highest mAP (68.7%), which is improved by 3.5% than YOLOV7 methods. Our approach provides a new design pattern for prohibited item detection in complex environments and shows the development potential of deep learning in related fields.
- Research Article
97
- 10.1007/s40747-023-01180-7
- Aug 11, 2023
- Complex & Intelligent Systems
Industrial defect detection is a hot topic in the field of computer vision. It is a challenging task due to complex features and many categories of industrial defects. In this paper, a deep learning model based on the multiscale feature extraction module is introduced for steel surface defect detection. The main focus on the feature extraction capability of the model and feature fusion capability to improve the accuracy of the model for steel surface defect detection. First, to improve the feature extraction ability of the model, a multiscale feature extraction (MSFE) module is introduced. The MSFE module can effectively extract multiscale features through three branches that have different convolution kernel sizes. Second, an efficient feature fusion (EFF) module is proposed to optimize feature fusion by adding features from the backbone network to the neck network. Third, this paper puts forward a new Bottleneck module by reducing the normalization layer and activation function in the original Bottleneck module. Finally, the backbone network is deepened to further enhance the feature extraction ability of the model. Extensive experiments are conducted on the public NEU-DET dataset. The experimental results validate the effectiveness of the designed modules and the proposed model. Compared with other state-of-the-art methods, the proposed model achieves optimal accuracy(73.08% mAP@0.5) while maintaining a small number of parameters.
- Conference Article
22
- 10.1109/iciea.2019.8834193
- Jun 1, 2019
Stereo matching plays an indispensable part in autonomous driving, robotics and 3D scene reconstruction. We propose a novel deep learning architecture, which called CFP-Net, a Cross-Form Pyramid stereo matching network for regressing disparity from a rectified pair of stereo images. The network consists of three modules: Multi-Scale 2D local feature extraction module, Cross-form spatial pyramid module and Multi-Scale 3D Feature Matching and Fusion module. The Multi-Scale 2D local feature extraction module can extract enough multi-scale features. The Cross-form spatial pyramid module aggregates the context information in different scales and locations to form a cost volume. Moreover, it is proved to be more effective than SPP and ASPP in ill-posed regions. The Multi-Scale 3D feature matching and fusion module is proved to regularize the cost volume using two parallel 3D deconvolution structure with two different receptive fields. Our proposed method has been evaluated on the Scene Flow and KITTI datasets. It achieves state-of-the-art performance on the KITTI 2012 and 2015 benchmarks.
- Conference Article
- 10.1109/isceic67854.2025.11405481
- Nov 24, 2025
Fire detection in complex environments are prone to multiple challenges, such as multi-scale targets, environmental interferences and real-time constraints. Mainstream real-time object detectors struggle to address these issues simultaneously with their original structures. To overcome these limitations, this paper proposes a novel fire detection framework named FIRE-DEIM, which is based on the end-to-end detector, DETR with Improved Matching (DEIM). The framework enhances features of flames to achieve more accurate detection. Specifically, a two-stage dehazing mechanism based on dark channel prior is designed to preprocess the input images, enhancing the visual characteristics of flames obscured by smoke. In addition, the original upsampling and subsequent feature fusion modules are improved to alleviate information loss in transmission of feature maps, while spatial shift operations and coordinate attention are introduced to further strengthen feature representation. Finally, partial convolution is applied to specific feature extraction modules for lightweight optimization, significantly reducing the computational cost. Comparative experiments conducted on two synthetic datasets demonstrate that, compared with other popular real-time detectors, FIRE-DEIM achieves an excellent balance between accuracy and efficiency. Visualization analysis further highlights its advantages in detecting multi-scale flames and smoke-obscured targets. These experimental results collectively indicate that the proposed algorithm exhibits stronger robustness and is better suited for fire detection in complex environments.
- Research Article
1
- 10.1038/s41598-024-72523-8
- Sep 13, 2024
- Scientific Reports
To address the problem of dense crowd face detection in complex environments, this paper proposes a face detection model named Deep and Compact Face Detection (DCFD), which adopts an improved lightweight EfficientNetV2 network to replace the backbone network of RetinaFace. A large kernel attention mechanism is introduced to address the face detection task more accurately. The backbone network, an improved efficient channel attention (ECA) mechanism, is added to further improve the algorithm performance. The feature fusion module is an improved neural architecture search feature pyramid network (NAS-FPN) that significantly improves the face detection accuracy in different scenes. To balance the training process of positive and negative samples, we use the focus loss function to replace the traditional cross-entropy loss function. In different environments, the DCFD algorithm has shown efficient face detection performance. This algorithm provides not only a feasible and effective solution for solving the problem of face detection in dense groups but also an important basis for improving the accuracy of face detection models in practical applications.
- Book Chapter
1
- 10.1007/978-3-642-39109-5_8
- Jan 1, 2013
During the detection in complex biological system, similar to other fluorescence probes, the molecular beacons suffer severely from the background signal interference. Recent studies indicated that excimer molecular beacon (EMB) can address the problem. EMB is a dual-pyrene-labeled hairpin DNA structure with large Stokes shift and long fluorescence lifetime, which afford an effective strategy for detection in complex biological environment. In the chapter, the recent development of the research of EMB is presented. Firstly, the general design of EMB, as well as the structure and working mechanism, is introduced. Furthermore, the synthesis and properties of EMB are descripted, which explain its capability of detection in complex environment and high sensitivity and selectivity. Finally, the examples of the different applications are discussed, including nucleic acids and other molecules detection.
- Conference Article
20
- 10.1109/cvpr.2003.1211383
- Jun 18, 2003
S-AdaBoost is a new variant of AdaBoost and is more effective than the conventional AdaBoost in handling outliers in pattern detection and classification in real world complex environment. Utilizing the divide and conquer principle, S-AdaBoost divides the input space into a few sub-spaces and uses dedicated classifiers to classify patterns in the sub-spaces. The final classification result is the combination of the outputs of the dedicated classifiers. S-AdaBoost system is made up of an AdaBoost divider, an AdaBoost classifier, a dedicated classifier for outliers, and a non-linear combiner. In addition to presenting face detection test results in a complex airport environment, we have also conducted experiments on a number of benchmark databases to test the algorithm. The experiment results clearly show S-AdaBoost's effectiveness in pattern detection and classification.
- Conference Article
10
- 10.1109/igarss47720.2021.9553680
- Jul 11, 2021
Semantic segmentation of high-resolution remote sensing (HRRS) images has been a long-term research topic in the field of remote sensing. Nowdays, many excellent networks based on deep learning have been applied in various remote sensing fields. However, these networks always have a large number of network parameters and rely on extensive computing resources. To solve the above problems, we propose a lightweight dual branches network with the attention modules and the feature fusion module. The backbone networks of dual branches, which have fewer parameters, are used to obtain the detail information and the context information respectively. The attention modules are used to establish full-image dependencies over the local feature representations. The feature fusion module is used to fuse the low-leve features and the high-leve features effectively. Compared to other popular networks, our network has better results evaluated on ISPRS Vaihingen Dataset while with fewer parameters(8M).
- Book Chapter
5
- 10.1007/978-3-540-30212-4_12
- Jan 1, 2004
In this paper, a framework is proposed for the foreground detection in various complex environments. This method integrates the detection and tracking procedures into a unified probability framework by considering the spatial, spectral and temporal information of pixels to model different complex backgrounds. Firstly, a Bayesian framework, which combines the prior distribution of the pixel’s features and the likelihood probability with a homogeneous region-based background model, is introduced to classify pixels into foreground and background. Secondly, an updating scheme, which includes an on-line learning process of the prior probability and background model updating, is employed to guarantee the accuracy of accumulated statistical knowledge of pixels over time when environmental conditions are changed. By minimizing the difference between the priori and the posterior distribution of pixels within a short-term temporal frame buffer, a recursive on-line prior probability learning scheme enables the system to rapidly converge to the new equilibrium condition in response to the gradual environmental changes. This framework is demonstrated in a variety of environments including swimming pools, shopping malls, office and campuses. Compared with existing methods, this proposed methodology is more robust and efficient.
- Research Article
- 10.14419/9x67sq37
- Dec 15, 2025
- International Journal of Basic and Applied Sciences
Lane detection in complex driving environments remains challenging due to issues such as poor visibility, occlusions, and intricate lane topologies. To address these challenges, this paper proposes MPDDNet, a novel multi-scale parallel network that integrates two key com-ponents: a Direction-aware Adaptive Multi-scale Feature Fusion (DAM-FF) module and a Deformable Non-local (DF-NL) module. The DAM-FF module explicitly embeds directional priors into multi-scale feature fusion through adaptive weighting and direction-aware spatial attention, significantly enhancing detailed feature representation in challenging scenarios. The DF-NL module combines multi-scale feature fusion with deformable attention mechanisms, enabling efficient global context modeling while implicitly incorporating structural priors of lane geometry. Through parallel integration, these modules achieve synergistic optimization of local details and global semantics. Extensive experiments on three benchmarks demonstrate that MPDDNet establishes new state-of-the-art performance, achieving 83.03% F1 score and 63.54% mF1 score on the CULane dataset. Our method also achieves remarkable results on the LLAMAS and TuSimple benchmarks, with 97.80% F1@50 and 98.40% F1 score, respectively. The consistent superiority across all three datasets and various challenging scenarios validates our approach's robustness and generalization capability, providing an effective solution for lane detection in complex environments.
- Research Article
1
- 10.21037/qims-24-683
- Oct 1, 2024
- Quantitative imaging in medicine and surgery
Patients with multiple myeloma (MM), a malignant disease involving bone marrow plasma cells, shows significant susceptibility to bone degradation, impairing normal hematopoietic function. The accurate and effective segmentation of MM lesion areas is crucial for the early detection and diagnosis of myeloma. However, the presence of complex shape variations, boundary ambiguities, and multiscale lesion areas, ranging from punctate lesions to extensive bone damage, presents a formidable challenge in achieving precise segmentation. This study thus aimed to develop a more accurate and robust segmentation method for MM lesions by extracting rich multiscale features. In this paper, we propose a novel, multiscale feature fusion encoding-decoding model architecture specifically designed for MM segmentation. In the encoding stage, our proposed multiscale feature extraction module, dilated dense connected net (DCNet), is employed to systematically extract multiscale features, thereby augmenting the model's sensing field. In the decoding stage, we propose the CBAM-atrous spatial pyramid pooling (CASPP) module to enhance the extraction of multiscale features, enabling the model to dynamically prioritize both channel and spatial information. Subsequently, these features are concatenated with the final output feature map to optimize segmentation outcomes. At the feature fusion bottleneck layer, we incorporate the dynamic feature fusion (DyCat) module into the skip connection to dynamically adjust feature extraction parameters and fusion processes. We assessed the efficacy of our approach using a proprietary dataset of MM, yielding notable advancements. Our dataset comprised 753 magnetic resonance imaging (MRI) two-dimensional (2D) slice images of the spinal regions from 45 patients with MM, along with their corresponding ground truth labels. These images were primarily obtained from three sequences: T1-weighted imaging (T1WI), T2-weighted imaging (T2WI), and short tau inversion recovery (STIR). Using image augmentation techniques, we expanded the dataset to 3,000 images, which were employed for both model training and prediction. Among these, 2,400 images were allocated for training purposes, while 600 images were reserved for validation and testing. Our method showed increase in the intersection over union (IoU) and Dice coefficients by 7.9 and 6.7 percentage points, respectively, as compared to the baseline model. Furthermore, we performed comparisons with alternative image segmentation methodologies, which confirmed the sophistication and efficacy of our proposed model. Our proposed multiple myeloma segmentation net (MMNet), can effectively extract multiscale features from images and enhance the correlation between channel and spatial information. Furthermore, a systematic evaluation of the proposed network architecture was conducted on a self-constructed, limited dataset. This endeavor holds promise for offering valuable insights into the development of algorithms for future clinical applications.
- Research Article
- 10.1049/cvi2.70032
- Jan 1, 2025
- IET Computer Vision
ABSTRACTIn recent years, the development of deep learning algorithms has significantly advanced the application of synthetic aperture radar (SAR) aircraft detection in remote sensing and military fields. However, existing methods face a dual dilemma: CNN‐based models suffer from insufficient detection accuracy due to limitations in local receptive fields, whereas Transformer‐based models improve accuracy by leveraging attention mechanisms but incur significant computational overhead due to their quadratic complexity. This imbalance between accuracy and efficiency severely limits the development of SAR aircraft detection. To address this problem, this paper propose a novel neural network based on state space models (SSM), termed the Mamba SAR detection network (MSAD). Specifically, we design a feature encoding module, MEBlock, that integrates CNN with SSM to enhance global feature modelling capabilities. Meanwhile, the linear computational complexity brought by SSM is superior to that of Transformer architectures, achieving a reduction in computational overhead. Additionally, we propose a context‐aware feature fusion module (CAFF) that combines attention mechanisms to achieve adaptive fusion of multi‐scale features. Lastly, a lightweight parameter‐shared detection head (PSHead) is utilised to effectively reduce redundant parameters through implicit feature interaction. Experiments on the SAR‐AirCraft‐v1.0 and SADD datasets show that MSAD achieves higher accuracy than existing algorithms, whereas its GFLOPs are 2.7 times smaller than those of the Transformer architecture RT‐DETR. These results validate the core role of SSM as an accuracy‐efficiency balancer, reflecting MSAD's perceptual capability and performance in SAR aircraft detection in complex environments.
- Research Article
- 10.3390/electronics11233895
- Nov 25, 2022
- Electronics
License plate detection plays a significant role in intelligent transportation systems. Convolutional neural networks have shown a remarkable performance and made significant progress for the detection task. Despite these outstanding achievements, license plate detection in complex environments is still a challenging task, due to the noisy background, unpredictable environments and varying shapes and sizes of the license plates. In order to improve the performance of license plate detection in complex environments, we propose a novel approach using an attention-guided dual feature pyramid and a cascaded positioning strategy. At first, the original features of images are extracted by the residual network. In order to make sure that each feature map contains higher- and lower-level semantic information, we utilize a bottom-up and a top-down pathway, respectively. Meanwhile, the proposed attention-guided dual feature pyramid network is used to receive the extracted features for a multilevel feature fusion. Our proposed attention-guided modules contain both spatial and channel attention. Attention-guided modules deduce the attention weights according to channel and spatial dimensions and multiply the calculated result with the input to obtain the refined feature maps. Then, a region proposal network is used to generate the candidate regions for the license plates. Finally, a cascaded positioning network is utilized to obtain the final locations of the license plates. To validate the effectiveness of the proposed method, we conducted a series of experiments on different public datasets. Experiments on AOLP and CCPD validated the effectiveness of our proposed method.
- Research Article
- 10.1016/j.medengphy.2025.104442
- Sep 19, 2025
- Medical engineering & physics
PSCT-Net: A parallel symmetric CNN-transformer hybrid network for medical image segmentation.
- Conference Article
11
- 10.1109/iaeac50856.2021.9391019
- Mar 12, 2021
With the development of depth learning and synthetic aperture radar (Synthetic Aperture Radar, SAR) technology, SAR image target detection based on convolution neural network (convolutional neural network, CNN) has achieved certain results. However, there are still problems in SAR detection of near-shore ship targets in complex environments. For improving the detection performance of the algorithm, the detection rate of SAR image near shore ship targets in complex environment is improved. This paper proposes an algorithm for SAR image ship target detection in complex environment. The algorithm first uses convolution neural network for coastal segmentation, and SAR image ship target detection through the results of coastal segmentation. The experimental results show that the algorithm has efficient detection ability for SAR image near-shore ship target detection in complex environment.
- Research Article
6
- 10.1155/2022/6205108
- Jan 5, 2022
- Computational Intelligence and Neuroscience
The existing face detection methods were affected by the network model structure used. Most of the face recognition methods had low recognition rate of face key point features due to many parameters and large amount of calculation. In order to improve the recognition accuracy and detection speed of face key points, a real-time face key point detection algorithm based on attention mechanism was proposed in this paper. Due to the multiscale characteristics of face key point features, the deep convolution network model was adopted, the attention module was added to the VGG network structure, the feature enhancement module and feature fusion module were combined to improve the shallow feature representation ability of VGG, and the cascade attention mechanism was used to improve the deep feature representation ability. Experiments showed that the proposed algorithm not only can effectively realize face key point recognition but also has better recognition accuracy and detection speed than other similar methods. This method can provide some theoretical basis and technical support for face detection in complex environment.