Inference of a Concise Regular Expression Considering Interleaving from XML Documents

Xiaolan Zhang,Fanlin Cui,Chunmei Dong,Yeting Li,Haiming Chen

doi:10.1007/978-3-319-93037-4_31

Abstract

XML schemas are useful in various applications. However, many XML documents in practice are not accompanied by a schema or by a valid schema. Therefore, it is essential to design efficient algorithms for schema learning. Each element in XML schema has its content model defined by a regular expression. Schema learning can be reduced to the inference of restricted regular expressions. In this paper, we focus on learning restricted regular expressions with interleaving from a set of XML documents. The new subclass is named as CHAin Regular Expression with Interleaving (ICHARE). Then based on single occurrence automaton (SOA) and maximum independent set (MIS), we introduce an inference algorithm GenICHARE. The algorithm is proved to infer a descriptive ICHARE from a set of given sample. At last, based on the data set crawled from the Web, we compare the coverage proportion of ICHARE compared with other existing subclasses. Besides, we analyze the conciseness of regular expressions inferred by GenICHARE based on DBLP. Experimental results show that ICHARE is more concise and useful in practice, and the inference algorithm is promising and effective.

Full Text