Enhancing Scene Text Recognition with Visual Context Information

Ahmed Sabir

doi:10.1145/3582524.3582542

Abstract

This thesis addresses the problem of improving text spotting systems, which aim to detect and recognize text in unrestricted images (e.g., a street sign, an advertisement, a bus destination, etc.). The goal is to improve the performance of off-the-shelf vision systems by exploiting the semantic information derived from the image itself. The rationale is that knowing the content of the image or the visual context can help to decide which words are the correct candidate words. For example, the fact that an image shows a coffee shop makes it more likely that a word on a signboard reads as Dunkin and not unkind. We address this problem by drawing on successful developments in natural language processing and machine learning, in particular, learning to re-rank and neural networks, to present post-process frameworks that improve state-of-the-art text spotting systems without the need for costly data-driven re-training or tuning procedures. Discovering the degree of semantic relatedness of candidate words and their image context is a task related to assessing the semantic similarity between words or text fragments. However, semantic relatedness is more general than similarity (e.g., car, road , and traffic light are related but not similar) and requires certain adaptations. To meet the requirements of these broader perspectives of semantic similarity, we develop two approaches to learn the semantic relatedness of the spotted word and its environmental context: word-to-word (object) or word-to-sentence (caption). In the word-to-word approach, word embedding based re-rankers are developed. The re-ranker takes the words from the text spotting baseline and re-ranks them based on the visual context from the object classifier. For the second, an end-to-end neural approach is designed to drive image description (caption) at the sentence-level as well as the word-level (objects) and re-rank them based not only on the visual context but also on the co-occurrence between them. As an additional contribution, to meet the requirements of data-driven approaches such as neural networks, we propose a visual context dataset for this task, in which the publicly available COCO-text dataset 1 has been extended with information about the scene (including the objects and places appearing in the image) to enable researchers to include the semantic relations between texts and scene in their Text Spotting systems, and to offer a common evaluation baseline for such approaches. Awarded by: Universitat Politècnica de Catalunya, Barcelona, Spain on 10 September 2020. Supervised by: Lluís Padró and Francesc Moreno-Noguer. Available at: https://upcommons.upc.edu/handle/2117/334952.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Enhancing Scene Text Recognition with Visual Context Information

Abstract

Talk to us

Similar Papers

More From: ACM SIGIR Forum

Lead the way for us

Journal: ACM SIGIR Forum	Publication Date: Jun 1, 2022
License type: cc-by

Similar Papers

Personal Data in Artificial Intelligence Systems: Natural Language Processing Technology
I G Ilin
Journal of Digital Technologies and Law | VOL. 2
I G IlinI G Ilin
20 Mar 2024
Journal of Digital Technologies and Law | VOL. 2

Automatic Classification for Ontology Generation by Pretrained Language Model
Atsushi Oba ... Ayato Kuwana
-
Atsushi Oba, et. al.Atsushi Oba ... Ayato Kuwana
01 Jan 2020
01 Jan 2020

Chatbot: A Deep Neural Network Based Human to Machine Conversation Model
G Krishna Vamsi ... Gaurav Hajela
-
G Krishna Vamsi, et. al.G Krishna Vamsi ... Gaurav Hajela
01 Jul 2020
01 Jul 2020

A review of natural language processing for financial technology
Ruizhuo Gao ... Weijuan Zhang
-
Ruizhuo Gao, et. al.Ruizhuo Gao ... Weijuan Zhang
28 Oct 2021
28 Oct 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Enhancing Scene Text Recognition with Visual Context Information

Abstract

Talk to us

Similar Papers

More From: ACM SIGIR Forum