Abstract

Spreadsheets are a popular way to represent and structure data and knowledge; in this connection semantic interpretation of spreadsheets data has become an active area of scientific research. In this paper, we propose a new approach for semantic interpretation of data extracted from spreadsheets with arbitrary layouts and styles. Analyzed spreadsheets are presented in the MS Excel format. In particular, our approach includes two stages: analyzing and transforming source spreadsheets to spreadsheets in a relational canonicalized form; annotating canonical spreadsheets by entities from a knowledge graph. At the first stage we use a rule-based approach implemented in the form of a domain-specific language called Cells Rule Language (CRL), and an original form of a canonical table. At the second stage we use an aggregated method for defining similarity between candidate entities and cell values that consists of the sequential application of five metrics and combining ranks obtained by each metric. Algorithms of each stage are implemented in the form of special software: TabbyXL and TabbyLD respectively. DBpedia is used as a knowledge graph. Experimental evaluations of our proposals are obtained for T2Dv2 and Troy200 corpuses, and they demonstrates the applicability of our approach and software for semantic spreadsheet data interpretation. The feature of the approach is its universality due to the use of the language for describing spreadsheets transformation rules, as well as an original canonical form. This feature provides processing large volumes of heterogeneous spreadsheets in various domains. This work is a part of the Tabby research project for software development of recognition, extraction, transformation and interpretation of data from spreadsheet tables with arbitrary layouts and styles.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call