Handwritten Annotation Spotting in Printed Documents Using Top-Down Visual Saliency Models

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

In this article, we address the problem of localizing text and symbolic annotations on the scanned image of a printed document. Previous approaches have considered the task of annotation extraction as binary classification into printed and handwritten text. In this work, we further subcategorize the annotations as underlines, encirclements, inline text, and marginal text. We have collected a new dataset of 300 documents constituting all classes of annotations marked around or in-between printed text. Using the dataset as a benchmark, we report the results of two saliency formulations—CRF Saliency and Discriminant Saliency, for predicting salient patches, which can correspond to different types of annotations. We also compare our work with recent semantic segmentation techniques using deep models. Our analysis shows that Discriminant Saliency can be considered as the preferred approach for fast localization of patches containing different types of annotations. The saliency models were learned on a small dataset, but still, give comparable performance to the deep networks for pixel-level semantic segmentation. We show that saliency-based methods give better outcomes with limited annotated data compared to more sophisticated segmentation techniques that require a large training set to learn the model.

Similar Papers
  • Conference Article
  • Cite Count Icon 141
  • 10.1109/icdar.2017.50
Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection
  • Nov 1, 2017
  • Dafang He + 4 more

Page segmentation and table detection play an important role in understanding the structure of documents. We present a page segmentation algorithm that incorporates state-of-the-art deep learning methods for segmenting three types of document elements: text blocks, tables, and figures. We propose a multi-scale, multi-task fully convolutional neural network (FCN) for the tasks of semantic page segmentation and element contour detection. The semantic segmentation network accurately predicts the probability at each pixel of the three element classes. The contour detection network accurately predicts instance level "edges" around each element occurrence. We propose a conditional random field (CRF) that uses features output from the semantic segmentation and contour networks to improve upon the semantic segmentation network output. Given the semantic segmentation output, we also extract individual table instances from the page using some heuristic rules and a verification network to remove false positives. We show that although we only consider a page image as input, we produce comparable results with other methods that relies on PDF file information and heuristics and hand crafted features tailored to specific types of documents. Our approach learns the representative features for page segmentation from real and synthetic training data. %, and produces good results on real documents. The learning-based property makes it a more general method than existing methods in terms of document types and element appearances. For example, our method reliably detects sparsely lined tables which are hard for rule-based or heuristic methods.

  • Conference Article
  • Cite Count Icon 709
  • 10.1109/cvpr.2018.00733
Weakly-Supervised Semantic Segmentation Network with Deep Seeded Region Growing
  • Jun 1, 2018
  • Zilong Huang + 4 more

This paper studies the problem of learning image semantic segmentation networks only using image-level labels as supervision, which is important since it can significantly reduce human annotation efforts. Recent state-of-the-art methods on this problem first infer the sparse and discriminative regions for each object class using a deep classification network, then train semantic a segmentation network using the discriminative regions as supervision. Inspired by the traditional image segmentation methods of seeded region growing, we propose to train a semantic segmentation network starting from the discriminative regions and progressively increase the pixel-level supervision using by seeded region growing. The seeded region growing module is integrated in a deep segmentation network and can benefit from deep features. Different from conventional deep networks which have fixed/static labels, the proposed weakly-supervised network generates new labels using the contextual information within an image. The proposed method significantly outperforms the weakly-supervised semantic segmentation methods using static labels, and obtains the state-of-the-art performance, which are 63.2% mIoU score on the PASCAL VOC 2012 test set and 26.0% mIoU score on the COCO dataset.

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-025-22560-8
Hierarchical attention mechanism combined with deep neural networks for accurate semantic segmentation of dental structures in panoramic radiographs
  • Nov 5, 2025
  • Scientific Reports
  • Mehrdad Esmaeili + 6 more

Computer vision, a rapidly advancing branch of artificial intelligence (AI), has gained significant attention in medical and dental applications. Semantic segmentation, a key technique within computer vision, enables the precise identification and delineation of objects at the pixel level, offering transformative potential for diagnostic imaging in dentistry. Panoramic radiographs are essential for diagnosing oral and maxillofacial conditions, yet their interpretation remains time-consuming and prone to human error, particularly in complex cases. This study evaluates the performance of a deep learning-based semantic segmentation model designed to identify and classify 24 distinct anatomical and pathological structures in panoramic radiographs. A dataset of 844 annotated panoramic images was collected from multiple radiography centers and used for training and testing. The model employs a hierarchical multi-scale attention mechanism to enhance accuracy by analyzing images at varying resolutions. Performance was assessed using key metrics, including specificity, accuracy, precision, recall, F1 score, and Intersection over Union (IoU). The proposed model demonstrated robust performance, achieving an overall accuracy of 98.73%, specificity of 98.86%, IoU value of 78.76%, precision of 86.97%, recall of 86.97%, and an F1 score of 84.54%. Notably, structures such as implants and amalgam restorations were identified with high reliability, while challenges persisted in detecting dental pulp and caries due to overlapping structures and subtle anatomical details. The deep neural network developed in this study exhibits significant potential for aiding dental professionals in accurately segmenting and identifying anatomical features in panoramic radiographs. While limitations exist in detecting specific intricate structures, the model’s performance underscores the value of AI-driven tools in enhancing diagnostic accuracy and treatment planning in dentistry. Future work may explore complementary imaging modalities to address the remaining challenges.

  • Research Article
  • Cite Count Icon 43
  • 10.1016/j.measurement.2023.113084
Improving RGB-D SLAM accuracy in dynamic environments based on semantic and geometric constraints
  • May 26, 2023
  • Measurement
  • Xiqi Wang + 3 more

Improving RGB-D SLAM accuracy in dynamic environments based on semantic and geometric constraints

  • Research Article
  • Cite Count Icon 4
  • 10.2139/ssrn.4231956
A Unified Architecture of Semantic Segmentation and Hierarchical Generative Adversarial Networks for Expression Manipulation
  • Jan 1, 2022
  • SSRN Electronic Journal
  • Rumeysa Bodur + 2 more

Editing facial expressions by only changing what we want is a long-standing research problem in Generative Adversarial Networks (GANs) for image manipulation. Most of the existing methods that rely only on a global generator usually suffer from changing unwanted attributes along with the target attributes. Recently, hierarchical networks that consist of both a global network dealing with the whole image and multiple local networks focusing on local parts are showing success. However, these methods extract local regions by bounding boxes centred around the sparse facial key points which are non-differentiable, inaccurate and unrealistic. Hence, the solution becomes sub-optimal, introduces unwanted artefacts degrading the overall quality of the synthetic images. Moreover, a recent study has shown strong correlation between facial attributes and local semantic regions. To exploit this relationship, we designed a unified architecture of semantic segmentation and hierarchical GANs. A unique advantage of our framework is that on forward pass the semantic segmentation network conditions the generative model, and on backward pass gradients from hierarchical GANs are propagated to the semantic segmentation network, which makes our framework an end-to-end differentiable architecture. This allows both architectures to benefit from each other. To demonstrate its advantages, we evaluate our method on two challenging facial expression translation benchmarks, AffectNet and RaFD, and a semantic segmentation benchmark, CelebAMask-HQ across two popular architectures, BiSeNet and UNet. Our extensive quantitative and qualitative evaluations on both face semantic segmentation and face expression manipulation tasks validate the effectiveness of our work over existing state-of-the-art methods.

  • Research Article
  • Cite Count Icon 9
  • 10.1016/j.engappai.2021.104587
Interweave features of Deep Convolutional Neural Networks for semantic segmentation
  • Dec 20, 2021
  • Engineering Applications of Artificial Intelligence
  • Shuang Bai + 2 more

Interweave features of Deep Convolutional Neural Networks for semantic segmentation

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.3390/s22010337
DTS-Net: Depth-to-Space Networks for Fast and Accurate Semantic Object Segmentation
  • Jan 3, 2022
  • Sensors (Basel, Switzerland)
  • Hatem Ibrahem + 2 more

We propose Depth-to-Space Net (DTS-Net), an effective technique for semantic segmentation using the efficient sub-pixel convolutional neural network. This technique is inspired by depth-to-space (DTS) image reconstruction, which was originally used for image and video super-resolution tasks, combined with a mask enhancement filtration technique based on multi-label classification, namely, Nearest Label Filtration. In the proposed technique, we employ depth-wise separable convolution-based architectures. We propose both a deep network, that is, DTS-Net, and a lightweight network, DTS-Net-Lite, for real-time semantic segmentation; these networks employ Xception and MobileNetV2 architectures as the feature extractors, respectively. In addition, we explore the joint semantic segmentation and depth estimation task and demonstrate that the proposed technique can efficiently perform both tasks simultaneously, outperforming state-of-art (SOTA) methods. We train and evaluate the performance of the proposed method on the PASCAL VOC2012, NYUV2, and CITYSCAPES benchmarks. Hence, we obtain high mean intersection over union (mIOU) and mean pixel accuracy (Pix.acc.) values using simple and lightweight convolutional neural network architectures of the developed networks. Notably, the proposed method outperforms SOTA methods that depend on encoder–decoder architectures, although our implementation and computations are far simpler.

  • Research Article
  • Cite Count Icon 13
  • 10.3390/app12157811
Real-Time Semantic Understanding and Segmentation of Urban Scenes for Vehicle Visual Sensors by Optimized DCNN Algorithm
  • Aug 3, 2022
  • Applied Sciences
  • Yanyi Li + 2 more

The modern urban environment is becoming more and more complex. In helping us identify surrounding objects, vehicle vision sensors rely more on the semantic segmentation ability of deep learning networks. The performance of a semantic segmentation network is essential. This factor will directly affect the comprehensive level of driving assistance technology in road environment perception. However, the existing semantic segmentation network has a redundant structure, many parameters, and low operational efficiency. Therefore, to reduce the complexity of the network and reduce the number of parameters to improve the network efficiency, based on the deep learning (DL) theory, a method for efficient image semantic segmentation using Deep Convolutional Neural Network (DCNN) is deeply studied. First, the theoretical basis of the convolutional neural network (CNN) is briefly introduced, and the real-time semantic segmentation technology of urban scenes based on DCNN is recommended in detail. Second, the atrous convolution algorithm and the multi-scale parallel atrous spatial pyramid model are introduced. On the basis of this, an Efficient Symmetric Network (ESNet) of real-time semantic segmentation model for autonomous driving scenarios is proposed. The experimental results show that: (1) On the Cityscapes dataset, the ESNet structure achieves 70.7% segmentation accuracy for the 19 semantic categories set, and 87.4% for the seven large grouping categories. Compared with other algorithms, the accuracy has increased to varying degrees. (2) On the CamVid dataset, compared with segmentation networks of multiple lightweight real-time images, the parameters of the ESNet model are around 1.2 m, the highest FPS value is around 90 Hz, and the highest mIOU value is around 70%. In seven semantic categories, the segmentation accuracy of the ESNet model is the highest at around 98%. From this, we found that the ESNet significantly improves segmentation accuracy while maintaining faster forward inference speed. Overall, the research not only provides technical support for the development of real-time semantic understanding and segmentation of DCNN algorithms but also contributes to the development of artificial intelligence technology.

  • Research Article
  • Cite Count Icon 15
  • 10.1016/j.sysarc.2024.103242
Evaluating single event upsets in deep neural networks for semantic segmentation: An embedded system perspective
  • Jul 20, 2024
  • Journal of Systems Architecture
  • Jon Gutiérrez-Zaballa + 2 more

Evaluating single event upsets in deep neural networks for semantic segmentation: An embedded system perspective

  • Conference Article
  • Cite Count Icon 57
  • 10.1109/avss.2018.8639077
Evaluating deep semantic segmentation networks for object detection in maritime surveillance
  • Nov 1, 2018
  • Tom Cane + 1 more

Maritime surveillance is important for applications in safety and security, but the visual detection of objects in maritime scenes remains challenging due to the diverse and unconstrained nature of such environments, and the need to operate in near real-time. Recent work on deep neural networks for semantic segmentation has achieved good performance in the road/urban scene parsing task. Driven by the potential application in autonomous vehicle navigation, many of the architectures are designed to be fast and lightweight. In this paper, we evaluate semantic segmentation networks in the context of an object detection system for maritime surveillance. Using data from the ADE20k scene parsing dataset, we train a selection of recent semantic segmentation network architectures to compare their performance on a number of publicly available maritime surveillance datasets.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 27
  • 10.3390/rs13142723
Semantic Segmentation of Satellite Images: A Deep Learning Approach Integrated with Geospatial Hash Codes
  • Jul 11, 2021
  • Remote Sensing
  • Naisen Yang + 1 more

Satellite images are always partitioned into regular patches with smaller sizes and then individually fed into deep neural networks (DNNs) for semantic segmentation. The underlying assumption is that these images are independent of one another in terms of geographic spatial information. However, it is well known that many land-cover or land-use categories share common regional characteristics within a certain spatial scale. For example, the style of buildings may change from one city or country to another. In this paper, we explore some deep learning approaches integrated with geospatial hash codes to improve the semantic segmentation results of satellite images. Specifically, the geographic coordinates of satellite images are encoded into a string of binary codes using the geohash method. Then, the binary codes of the geographic coordinates are fed into the deep neural network using three different methods in order to enhance the semantic segmentation ability of the deep neural network for satellite images. Experiments on three datasets demonstrate the effectiveness of embedding geographic coordinates into the neural networks. Our method yields a significant improvement over previous methods that do not use geospatial information.

  • Book Chapter
  • Cite Count Icon 39
  • 10.1007/978-3-031-37703-7_19
NNV 2.0: The Neural Network Verification Tool
  • Jan 1, 2023
  • Diego Manzanas Lopez + 3 more

This manuscript presents the updated version of the Neural Network Verification (NNV) tool. NNV is a formal verification software tool for deep learning models and cyber-physical systems with neural network components. NNV was first introduced as a verification framework for feedforward and convolutional neural networks, as well as for neural network control systems. Since then, numerous works have made significant improvements in the verification of new deep learning models, as well as tackling some of the scalability issues that may arise when verifying complex models. In this new version of NNV, we introduce verification support for multiple deep learning models, including neural ordinary differential equations, semantic segmentation networks and recurrent neural networks, as well as a collection of reachability methods that aim to reduce the computation cost of reachability analysis of complex neural networks. We have also added direct support for standard input verification formats in the community such as VNNLIB (verification properties), and ONNX (neural networks) formats. We present a collection of experiments in which NNV verifies safety and robustness properties of feedforward, convolutional, semantic segmentation and recurrent neural networks, as well as neural ordinary differential equations and neural network control systems. Furthermore, we demonstrate the capabilities of NNV against a commercially available product in a collection of benchmarks from control systems, semantic segmentation, image classification, and time-series data.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/igarss47720.2021.9553751
A Novel Deep Transfer Learning Method for SAR and Optical Fusion Imagery Semantic Segmentation
  • Jul 11, 2021
  • Yanjuan Liu + 1 more

Synthetic Aperture Radar (SAR) imagery have been one of the important tools to support earth observations and topographic measurements. It means SAR imagery are essentially rich in structures and some important target categories are difficult to recognize. Optical imagery contains rich and clear spectral information which has a good influence on semantic image segmentation. The success of deep neural networks for semantic segmentation heavily depends on large-scale and well-labeled data sets, which are hard to collect in practice. In this paper, we consider deep transfer learning for semantic segmentation, we propose a deep novel transfer learning method, which transfers a semantically segment model from SAR imagery to SAR and optical fusion imagery. The experimental results show that the method proposed achieves higher mean Intersection over Union (mIoU) with less training time compared with other methods.

  • Research Article
  • Cite Count Icon 90
  • 10.1109/jstars.2021.3140101
A Lightweight Complex-Valued DeepLabv3+ for Semantic Segmentation of PolSAR Image
  • Jan 1, 2022
  • IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
  • Lingjuan Yu + 6 more

Semantic image segmentation is one kindof end-to-end segmentation method which can classify the target region pixel by pixel. As a classic semantic segmentation network in optical images, DeepLabv3+ can achieve a good segmentation performance. However, when this network is directly used in the semantic segmentation of polarimetric synthetic aperture radar (PolSAR) image, it is hard to obtain the ideal segmentation results. The reason is that it is easy to yield overfitting due to the small PolSAR dataset. In this article, a lightweight complex-valued DeepLabv3+ (L-CV-DeepLabv3+) is proposed for semantic segmentation of PolSAR data. It has two significant advantages when compared with the original DeepLabv3+. First, the proposed network with the simplified structure and parameters can be suitable for the small PolSAR data, and thus, it can effectively avoid the overfitting. Second, the proposed complex-valued (CV) network can make full use of both amplitude and phase information of PolSAR data, which brings better segmentation performance than the real-valued (RV) network, and the related CV operations are strictly true in the mathematical sense. Experimental results about two Flevoland datasets and one San Francisco dataset show that the proposed network can obtain better overall average, mean intersection over union, and mean pixel accuracy than the original DeepLabv3+ and some other RV semantic segmentation networks.

  • Research Article
  • Cite Count Icon 391
  • 10.1109/tits.2022.3228042
Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes
  • Mar 1, 2023
  • IEEE Transactions on Intelligent Transportation Systems
  • Huihui Pan + 3 more

Using light-weight architectures or reasoning on low-resolution images, recent methods realize very fast scene parsing, even running at more than 100 FPS on a single GPU. However, there is still a significant gap in performance between these real-time methods and the models based on dilation backbones. To this end, we proposed a family of deep dual-resolution networks (DDRNets) for real-time and accurate semantic segmentation, which consist of deep dual-resolution backbones and enhanced low-resolution contextual information extractors. The two deep branches and multiple bilateral fusions of backbones generate higher quality details compared to existing two-pathway methods. The enhanced contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) enlarges effective receptive fields and fuses multi-scale context based on low-resolution feature maps with little time cost. Our method achieves a new state-of-the-art trade-off between accuracy and speed on both Cityscapes and CamVid dataset. For the input of full resolution, on a single 2080Ti GPU without hardware acceleration, DDRNet-23-slim yields 77.4% mIoU at 102 FPS on Cityscapes test set and 74.7% mIoU at 230 FPS on CamVid test set. With widely used test augmentation, our method is superior to most state-of-the-art models and requires much less computation. Codes and trained models are available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/ydhongHIT/DDRNet</uri> .

Save Icon
Up Arrow
Open/Close