Abstract
We consider the problem of learning a density function from observations of an unknown underlying model in a distributed setting, where the observations are partitioned into different sites. Applying commonly used density estimation methods such as Gaussian Mixture Model (GMM) or Kernel Density Estimation (KDE) to distributed data leads to an extensive amount of communication. A familiar approach to address this issue is to sample a small subset of data and collect them into a central node to run the density estimation algorithms on them. In this paper, we follow an alternative to the sub-sampling approach by proposing the nested Log-Poly model. This model provides an accurate density estimation from a small sized statistic of the entire data. In distributed settings, it transfers the small sized statistics from the client nodes to a central node. The estimation process is then run in the central node. The proposed model can be used in different learning tasks such as classification in supervised learning and clustering in unsupervised learning. However, the properties of nested Log-Poly make it a suitable model for one-dimensional density estimations in the distributed settings. This makes Log-Poly a good choice for naive Bayes classifier, where one-dimensional density estimation is required for every feature conditioned on the class label. We provide a theoretical analysis of the efficiency of our model in estimating a wide range of probability density functions. Our experiments show that nested Log-Poly outperforms the state of the art density estimators on several synthetic datasets. We compare the accuracy and the communication load of naive Bayes classifier using nested Log-Poly and other related density estimators on several real datasets. The experimental outcomes depict that nested Log-Poly has less communication load, while maintaining a competitive classification accuracy compared to similar methods that use the entire data. Moreover, we present a comprehensive comparison between nested Log-Poly and validated KDE with sub-sampling, in terms of the number of communicated variables and the number of bytes transferred between the clients and the central node. Nested Log-Poly provides comparable accuracy with the validated KDE with sub-sampling, while communicating fewer variables. However, our method needs to compute and transmit the variables with a high precision in order to accurately capture the details of the underlying distributions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.