XML Corpus Research Articles

We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values.

Read full abstract

XML has been acknowledged as the defacto standard for data representation and exchange over the World Wide Web. Being self describing grants XML its great flexibility and wide acceptance but on the other hand it is the cause of its main drawback that of being huge in size. The huge document size means that the amount of information that has to be transmitted, processed, stored, and queried is often larger than that of other data formats. Several XML compression techniques has been introduced to deal with these problems. In this paper, we provide a complete survey over the state-of-the-art of XML compression techniques. In addition, we present an extensive experimental study of the available implementations of these techniques. We report the behavior of nine XML compressors using a large corpus of XML documents which covers the different natures and scales of XML documents. In addition to assessing and comparing the performance characteristics of the evaluated XML compression tools, the study also tries to assess the effectiveness and practicality of using these tools in the real world. Finally, we provide some guidelines and recommendations which are useful for helping developers and users for making an effective decision towards selecting the most suitable XML compression tool for their needs.

Read full abstract

XML Corpus Research Articles

Related Topics

Articles published on XML Corpus

Translating Fieldwork into Datasets: The Development of a Corpus for the Quantitative Investigation of Grammatical Phenomena in Eibela

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Optimal Probabilistic Generation of XML Documents

An Application of Topic Map-Based Ontology Generated from Wikipedia for Query Expansion

XML document information retrieval model based on four-layered Bayesian network

XML 문서의 구조기반 검색성능 평가

XML compression techniques: A survey and comparison

Extractive summarisation of legal texts

SAGAXSEARCH: AN XML INFORMATION RETRIEVAL MECHANISM USING SELF ADAPTIVE GENETIC ALGORITHMS

A Bayesian Framework for XML Information Retrieval: Searching and Learning with the INEX Collection

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

XML Corpus Research Articles

Related Topics

Articles published on XML Corpus

Translating Fieldwork into Datasets: The Development of a Corpus for the Quantitative Investigation of Grammatical Phenomena in Eibela

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Optimal Probabilistic Generation of XML Documents

An Application of Topic Map-Based Ontology Generated from Wikipedia for Query Expansion

XML document information retrieval model based on four-layered Bayesian network

XML 문서의 구조기반 검색성능 평가

XML compression techniques: A survey and comparison

Extractive summarisation of legal texts

SAGAXSEARCH: AN XML INFORMATION RETRIEVAL MECHANISM USING SELF ADAPTIVE GENETIC ALGORITHMS

A Bayesian Framework for XML Information Retrieval: Searching and Learning with the INEX Collection