Abstract

Abstract The presence of a schema for eXtensible Markup Language (XML) documents has numerous advantages. Unfortunately, many XML documents in practice are not accompanied by a (valid) schema. Therefore, it is essential to devise algorithms to infer schemas from XML documents, where the fundamental task is learning regular expressions. In this paper, we focus on the learning of disjunctive multiplicity expressions (DMEs), a subclass of regular expressions that are particularly suitable to specify unordered models and have been used as the foundation of the schemas for unordered XML. Previous work for learning DME lacks inference algorithms that support positive and negative examples. Further, presently there has been no algorithm can learn DMEs extended with numeric occurrences. We address these challenges in the present paper and first propose a novel algorithm to learn DMEs from positive and negative examples by using genetic algorithms and parallel techniques. Then we extend DMEs to disjunctive generalized multiplicity expressions (DGMEs), which allow numeric occurrences and develop an algorithm to learn DGMEs from positive and negative examples. Finally, experimental results show that with only positive examples, our algorithm can generate a DME with an acceptable learning time, which can accept all positive examples, and when given both positive and negative examples, we can learn DMEs or DGMEs with high accuracy.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.