Abstract

Although considerable progress has been made in recognizing multi-character text from images, there are still cases where there is a lack of robust computationally-efficient methods that can execute on portable devices to read device displays in the wild. We specifically address the problem of parsing digits from 7 segment displays. Recognizing these displays is important for many tasks such as assisting users with tasks using augmented reality agents that need to verify actions or connecting legacy devices to the internet for process control using cheap cameras. Legacy techniques based on image processing operators and OCR are brittle whereas massive deep networks are too computationally expensive. We describe a computationally tractable VGG style backbone combined with a novel digit inference head that can be trained using a synthetic display generator with novel augmentations. We show the model trained on augmented synthetic data generalizes well to a corpus of real-world display images getting 97.8% single-frame accuracy and obtaining a throughput of 30 frames per second. We describe how the output can be further stabilized to improve accuracy through a kind of mode filtering.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call