Abstract

Spreadsheet tables are one of the most commonly used formats to organise and store sets of statistical, financial, accounting and other types of data. This form of data representation is widely used in science, education, engineering, and business. The key feature of spreadsheet tables that they are generally created by people in order to be further used by other people rather than by automated programs. During spreadsheet creation, commonly, no consideration is given to the possibility of further automated data processing. This leads to a large variety of possible spreadsheet table structures and further complicates automated extraction of table content and table understanding. One of the key factors that influence on the quality of table understanding by machines is the correctness of the header structure, for example, position and relation between cells. In this paper, we present a case study of a tabular data extraction approach and estimate its performance on a variety of datasets. The rule-driven software platform TabbyXL was used for tabular data extraction and canonicalisation. The experiment was conducted on real-world tables of SAUS200 (The 2010 Statistical Abstract of the United States) corpora. For the evaluation, we used spreadsheet tables as they are presented in SAUS; the same tables, but with an automatically corrected header structure; and tables where the structure of the header was corrected by experts. The case study results demonstrate the importance of header structure correctness for automated table processing and understanding. The ground-truth preparation procedures, example of rules describing relationships between table elements, and results of the evaluation are presented in the paper.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.