Self-Extending Peer Data Management

Ralf Heese ,Felix Naumann ,Sven Herschel ,Armin Roth

doi:10.18452/9200

Abstract

Peer data management systems (PDMS) are the natural extension of integrated information systems. Conventionally, a single integrating system manages an integrated schema, distributes queries to appropriate sources, and integrates incoming data to a common result. In contrast, a PDMS consists of a set of peers, each of which can play the role of an integrating component. A peer knows about its neighboring peers by mappings, which help to translate queries and transform data. Queries submitted to one peer are answered by data residing at that peer and by data that is reached along paths of mappings through the network of peers. The only restriction for PDMS to cover unbounded data is the need to formulate at least one mapping from some known peer to a new data source. We propose a Semantic Web based method that overcomes this restriction, albeit at a price. As sources are dynamically and automatically included in a PDMS, three factors diminish quality: The new source itself might store data of poor quality, the mapping to the PDMS might be incorrect, and the mapping to the PDMS might be incomplete. To compensate, we propose a quality model to measure this effect, a cost model to restrict query planning to the best paths through the PDMS, and techniques to answer queries in such Webscale PDMS efficiently. 1 An Ever-growing PDMS The step from centralized database systems (DBMS) to distributed and then to federated database systems (FDBMS) removed the assumption that data must be located at the same site as the query. A federated database provides a global schema that represents the data it can access locally and remotely. The global schema is related to the local schemata via schema mappings, which specify how the schema of a local database maps to the global schema. The federated database accepts a query against its global schema and distributes it according to the schema mappings to the different sites where the data resides. Those sites execute the partial queries and send results back to the requesting peer. Again, the schema mappings specify how data is to be translated to conform to the global schema. The results are further processed and combined to be finally fused into a single response to the user. A natural extension to this paradigm is to remove the assumption that queries are only asked against a single integrating site. Peer data management systems (PDMS) are built of multiple peers, each of which provides a schema and accepts queries against the schema. Again, the peers are connected by mappings among their schemata. However, instead of forming a tree with a single root, each peer can be connected to any number of other peers. Queries against a schema of one peer can be answered using the data of the entire PDMS, as long as appropriate mappings have been formed (see Fig. 1). In general, a query

Full Text