What Constitutes Successful Format Conversion? Towards a Formalization of 'Intellectual Content'

C M Sperberg-Mcqueen

doi:10.2218/ijdc.v6i1.179

Abstract

Recent work in the semantics of markup languages may offer a way to achieve more reliable results for format conversion, or at least a way to state the goal more explicitly. In the work discussed, the meaning of markup in a document is taken as the set of things accepted as true because of the markup's presence, or equivalently, as the set of inferences licensed by the markup in the document. It is possible, in principle, to apply a general semantic description of a markup vocabulary to documents encoded using that vocabulary and to generate a set of inferences (typically rather large, but finite) as a result. An ideal format conversion translating a digital object from one vocabulary to another, then, can be characterized as one which neither adds nor drops any licensed inferences; it is possible to check this equivalence explicitly for a given conversion of a digital object, and possible in principle (although probably beyond current capabilities in practice) to prove that a given transformation will, if given valid and semantically correct input, always produce output that is semantically equivalent to its input. This approach is directly applicable to the XML formats frequently used for scientific and other data, but it is also easily generalized from SGML/XML-based markup languages to digital formats in general; at a high level, it is equally applicable to document markup, to database exchanges, and to ad hoc formats for high-volume scientific data.Some obvious complications and technical difficulties arising from this approach are discussed, as are some important implications. In most real-world format conversions, the source and target formats differ at least somewhat in their ontology, either in the level of detail they cover or in the way they carve reality into classes; it is thus desirable not only to define what a perfect format conversion looks like, but to quantify the loss or distortion of information resulting from the conversion.

Highlights

It is widely believed that preservation of digital objects over long periods will typically require repeated format conversions.2 In many cases, as Lesk (1992) points out, these will involve copying the digital content from one type of storage medium to another, in a permanent attempt to outrun the obsolescence of one generation after another of data carriers and their associated hardware
Recent work in the semantics of markup languages suggests that the answer to these questions is “yes”, at least in principle; this paper describes that work and describes its relevance for the long-term preservation of digital objects
A process for sentence generation will start from appropriate semantic descriptions of the vocabularies involved and use a general-purpose tool to apply those semantic descriptions to XML document instances and generate the enumerated inference sentences as a result

Summary

Introduction

It is widely (and plausibly) believed that preservation of digital objects over long periods will typically require repeated format conversions.2 In many cases, as Lesk (1992) points out, these will involve copying the digital content from one type of storage medium to another, in a permanent attempt to outrun the obsolescence of one generation after another of data carriers and their associated hardware. In the simplest imaginable case, where both the source vocabulary and the target vocabulary are described in terms of the same set of primitive notions ( the same sets of objects, relations and predicates), it might be possible to compare the sentences produced from the two documents: if the two documents produce different sets of inferences, the meaning has changed.

Results

Conclusion