Enhancing medical text detection with vision-language pre-training and efficient segmentation

Tianyang Li,Qingzhu Wang,Jinxu Bai

doi:10.1007/s40747-024-01378-3

Abstract

AbstractDetecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system’s improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Complex & Intelligent Systems	Publication Date: Feb 29, 2024
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Enhancing medical text detection with vision-language pre-training and efficient segmentation

Abstract

Talk to us

Similar Papers

More From: Complex & Intelligent Systems

Lead the way for us

Similar Papers

Sound source localization based on residual network and channel attention module
Fucai Hu ... Ruhan He
Scientific Reports | VOL. 13
Fucai Hu, et. al.Fucai Hu ... Ruhan He
03 Apr 2023
Scientific Reports | VOL. 13

An Effective Network Integrating Residual Learning and Channel Attention Mechanism for Thin Cloud Removal
Xue Wen ... Yuxin Hu
IEEE Geoscience and Remote Sensing Letters | VOL. 19
Xue Wen, et. al.Xue Wen ... Yuxin Hu
01 Jan 2021
IEEE Geoscience and Remote Sensing Letters | VOL. 19

Siamese Object Tracking Algorithm Combining Residual Connection and Channel Attention Mechanism
Jiangnan Shao ... Hongwei Ge
Journal of Computer-Aided Design & Computer Graphics | VOL. 33
Jiangnan Shao, et. al.Jiangnan Shao ... Hongwei Ge
01 Feb 2021
Journal of Computer-Aided Design & Computer Graphics | VOL. 33

Gross Tumor Volume Segmentation for Stage III NSCLC Radiotherapy Using 3D ResSE-Unet.
Xinhao Yu ... Yongzhong Wu
Technology in Cancer Research & Treatment | VOL. 21
Xinhao Yu, et. al.Xinhao Yu ... Yongzhong Wu
01 Jan 2021
Technology in Cancer Research & Treatment | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enhancing medical text detection with vision-language pre-training and efficient segmentation

Abstract

Talk to us

Similar Papers

More From: Complex &amp; Intelligent Systems

More From: Complex & Intelligent Systems