Abstract

Many researchers have addressed the problem of providing statistical analysis for centralized databases. Perhaps surprisingly, however, no one has yet carefully examined the architectures and issues arising when data are distributed over many sites. The paper proposes two approaches to providing an architecture for a system that gives system support for statistical analyses in situations where the underlying raw data are distributed among different (heterogeneous) database management systems (DBMSs) on different computing facilities at different sites. The underlying local DBMSs are not required to provide any statistical capabilities, but advantage can be taken of such capability where it is present. The first approach uses a distributed database management system (DDBMS) to obtain the raw data from the local DBMSs and provides statistical analyses at the global level. Two system configurations are proposed for this approach. The second approach distributes as much as possible of the statistical analyses to the local sites and uses partially processed data from the sites as input to the global analysis procedures. A system has been implemented to illustrate the functionality of the first approach, but its major drawback is the amount of raw data that has to be shipped around the network. In the second approach, which is currently being used as the basis of an implementation, normally only summary data are moved between sites; nodes are required simply to present statistical participation views over which global statistical queries involving a wide range of statistical analyses can be evaluated. A number of challenging and interesting design decisions on modular details must be made to permit the architecture to be refined for applications in particular domains. The paper does not give detailed solutions to the problems identified but provides a discussion of these within an architectural framework that has evolved from work on the existing implementations. It promises to be more pragmatic than that used for current implementations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call