Archiving and Maintaining Curated Databases.

Henrik Høeg Müller

doi:10.18453/rosdok_id00002203

Abstract

Curated databases represent a substantial amount of effort by a dedicated group of people to produce a definitive description of some subject area. The value of curated databases lies in the quality of the data that has been manually collected, corrected, and annotated by human curators. Many curated databases are continuously modified and new releases being published on the Web. Given that curated databases act as publications, archiving them becomes a necessity to enable retrieval of particular database versions. A system trying to archive evolving databases on the Web faces several challenges. First and foremost, the systems needs to be able to effciently maintain and query multiple snapshots of ever growing databases. Second, the system needs to be flexible enough to account for changes to the database structure and to handle data of varying quality. Third, the system needs to be robust and invulnerable to local failure to allow reliable long-term preservation of archived information. Our archive management system XArch addresses the first challenge by providing the functionality to maintain, populate, and query archives of database snapshots in hierarchical format. This presentation intends to give an overview of our ongoing efforts of improving XArch regarding (i) archiving evolving databases, (ii) supporting distributed archives, and (iii) using our archives and XArch as the basis of a system to create, maintain, and publish curated databases.

Full Text