Abstract

In some domains (e.g., molecular biology), data repositories are large in size, dynamic, and physically distributed. Consequently, it is neither desirable nor feasible to gather all the data in a centralized location for analysis. Hence, efficient distributed learning algorithms that can operate across multiple data sources without the need to transmit large amounts of data and cumulative learning algorithms that can cope with data sets that grow at rapid rate are needed. The problem of learning from distributed data can be summarized as follows: data is distributed across multiple sites and the learner’s task is to discover useful knowledge from all the available data. For example, such knowledge might be expressed in the form of a decision tree or a set of rules for pattern classification. A distributed learning algorithm LD is said to be exact with respect to the hypothesis inferred by a learning algorithm L ,i fthe hypothesis produced by LD, using distributed data sets D1 through Dn is the same as that obtained by L when it is given access to the complete data set D, which can be constructed (in principle) by combining the individual data sets D1 through Dn. Our approach to distributed learning is based on a decomposition of the learning task into information extraction and hypothesis generation components. This involves identifying the information requirements of a learning algorithm and designing efficient means of providing the needed information to the hypothesis generation component, while avoiding the need to transmit large amounts of data. This offers a general strategy for transforming a batch or centralized learning algorithm into an exact distributed algorithm. In this approach to distributed learning, only the information extrac

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call