Parameterized Mapping Distances for Semi-Structured Data

Kilho Shin,Taro Niiyama

doi:10.1007/978-3-030-05453-3_21

Abstract

The edit distances have been widely used as an effective method to analyze similarity of semi-structured data such as strings, trees and graphs. For example, the Levenshtein distance for strings is known to be effective to analyze DNA and proteins, and the Tai distance and its variations are attracting wide attention of researchers who study tree-type data such as glycan, HTML-DOM-trees, parse trees of natural language processing and so on. The problem that we recognize here is that the way of engineering new edit distances was ad-hoc and lacked a unified view. To solve the problem, we introduce the concept of the mapping distance and a hyper-parameter that controls costs of label mismatch. One of the most important advantages of our parameterized mapping distances consists in the fact that the distances can be defined for arbitrary finite sets in a consistent manner and some important properties such as satisfaction of the axioms of metrics can be discussed abstractly regardless of the structures of data. The second important advantage is that mapping distances themselves can be parameterized, and therefore, we can identify the best distance to a particular application by parameter search. The mapping distance framework can provide a unified view over various distance measures for semi-structured data focusing on partial one-to-one mappings between data. These partial one-to-one mappings are a generalization of what are known as mappings of edit paths in the legacy study of edit distances. This is a clear contrast to the legacy edit distance framework, which defines distances through edit operations and edit paths. Our framework enables us to design new distance measures in a consistent manner, and also, various distance measures can be described using a small number of parameters. In fact, in this paper, we take ordered rooted trees as an example and introduce three independent dimensions to parameterize mapping distance measures. Through intensive experiments using ten datasets, we identify two important mapping distances that can exhibit good classification performance when used with the k-NN classifier. These mapping distances are novel and have not been discussed in the literature.

Full Text