Implementation of an Algorithm for Extracting Information about Structural Elements of Text Documents in ODT Format

A V Berezhkov,V V Tereshchenko,V I Martsinkevich,G S Larionova

doi:10.17587/it.29.307-315

Abstract

The dependence of the XML markup of a digital document in the ODT format on the tool that was used to create it is considered. Not only specialized tools are used in comparison, but also those that do not directly work with the ODT format to identify the most vulnerable spots. The features of extracting data from the structural elements of the document, such as tables, lists and images, are also described. The implementation of an algorithm for obtaining style attributes used to create a system for automated normative control of digital documents is proposed and described. It is revealed that the non-strict standard of the ODT format led to the dependence of XML markup on the text editor that was used to create the document. And, as a result, to a limited number of tags that can be relied upon when developing document parsing algorithms. However, the task is feasible, as demonstrated in the article. Likewise, the default values, the description of the algorithm for bypassing the document in blocks and structural elements form the basis for preparing data for the subsequent creation of a classifier and automation of the process of normative control. Thus, the algorithm proposed in the article and the analysis of XML markup performed are effective tools for solving the problem of creating an automated document standard control system, and the algorithm has the potential for further improvement.

Full Text