The IBEM dataset: A large printed scientific image dataset for indexing and searching mathematical expressions

Dan Anitei,Joan Andreu Sánchez,José Miguel Benedí,Ernesto Noya

doi:10.1016/j.patrec.2023.05.033

Dan Anitei, Joan Andreu Sánchez + Show 2 more

Open Access

https://doi.org/10.1016/j.patrec.2023.05.033

Copy DOI

Abstract

Searching for information in printed scientific documents is a challenging problem that has recently received special attention from the Pattern Recognition research community. Mathematical expressions are complex elements that appear in scientific documents, and developing techniques for locating and recognizing them requires the preparation of datasets that can be used as benchmarks. Most current techniques for dealing with mathematical expressions are based on Machine Learning techniques which require a large amount of annotated data. These datasets must be prepared with ground-truth information for automatic training and testing. However, preparing large datasets with ground-truth is a very expensive and time-consuming task. This paper introduces the IBEM dataset, consisting of scientific documents that have been prepared for mathematical expression recognition and searching. This dataset consists of 600 documents, more than 8200 page images with more than 160000 mathematical expressions. It has been automatically generated from the ▪ version of the documents and can be enlarged easily. The ground-truth includes the position at the page level and the ▪ transcript for mathematical expressions both embedded in the text and displayed. This paper also reports a baseline classification experiment with mathematical symbols and a baseline experiment of Mathematical Expression Recognition performed on the IBEM dataset. These experiments aim to provide some benchmarks for comparison purposes so that future users of the IBEM dataset can have a baseline framework.

Full Text