Abstract

An important initial step of mathematical formula recognition is to correctly identify the location of formulae within documents. Previous work in this area has traditionally focused on image-based documents; however, given the prevalence and popularity of the PDF format for dissemination, alternatives to image-based approaches are increasingly being explored. In this paper, we investigate the use of both machine learning techniques and heuristic rules to locate the boundaries of both isolated and embedded formulae within documents, based upon data extracted directly from PDF files. We propose four new features along with preprocessing and post-processing techniques for isolated formula identification. Furthermore, we compare, analyse and extensively tune nine state-of-the-art learning algorithms for a comprehensive evaluation of our proposed methods. The evaluation is carried out over a ground-truth dataset, which we have made publicly available, together with an application adaptable fine-grained evaluation metric. Our experimental results demonstrate that the overall accuracies of isolated and embedded formula identification are increased by 11.52 and 10.65 %, compared with our previously proposed formula identification approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call