Subject of research. This article examines the process of data preparation for training neural network models that address the task of recognizing complex handwritten mathematical expressions and converting them into typesetting systems like LaTeX or MathML. The dataset under study is CROHME 2019, which contains images of handwritten mathematical expressions and their corresponding LaTeX annotations. The objective of this research is to develop and describe a data preparation process that ensures correct model training and minimizes errors in numerical labels and transformations. The study focuses on the following key stages: loading and preprocessing images, parsing LaTeX annotations, linking images with their corresponding annotations, normalizing pixel values, creating and verifying the tokenizer vocabulary, and preparing the tokenizer itself. The methods include data analysis and processing using specialized tools for working with images and text annotations. The main outcome of this work is the creation of a dataset ready for use in training the neural network model. The prepared dataset ensures accurate alignment between images and annotations, as well as correct conversion of mathematical expressions into LaTeX code.
Read full abstract