Abstract

We present space-efficient algorithms for performing Pearson’s chi-square goodness-of-fit test in a streaming setting. Since the chi-square test is one of the most well known and commonly used tests in statistics, it is surprising that there has been no prior work on designing streaming algorithms for it. The test is not based on a specific distribution assumption and has one-sample and two-sample variants. Given a stream of data, the one-sample variant tests if the stream is drawn from a fixed distribution. The two-sample variant tests if two data streams are drawn from the same or similar distributions. One major advantage of using statistical tests over other quantities commonly measured by streaming algorithms is that these tests do not require parameter tuning and have results that can be easily interpreted by data analysts. The problem that we solve in this paper is how to compute the chi-square test on streams with minimal parameter configuration and assumptions. We give rigorous proofs showing that it is possible to compute the chi-square statistic with high fidelity and an almost quadratic reduction in memory in the continuous case, but the categorical case only admits heuristic solutions. We validate the performance and accuracy of our algorithms through extensive testing on both real and synthetic data sets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.