Abstract

The sample size is a fundamental problem in statistics, which also plays a very important role in data collection for big data scenario, especially in the characterization of data structure. This paper considers this problem from the perspective of message importance by transforming the sampling procedure into the process of collecting message importance. To this end, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable similar to differential entropy and calculate the DMIM for some common distributions. Based on DMIM, this paper proposes a new approach to the required sampling number, where the DMIM deviation is constructed to characterize the process of collecting message importance. In fact, the DMIM deviation is a new criterion to choose sample size to be large enough that the message importance of sample set differs from the whole message importance by no more than the specified amount. In order to visually display that the DMIM deviation can guarantee the statistical performance to some extent, we transformed the difference of message importance into the Kolmogorov–Smirnov statistic. Theoretical analyses and numerical results also demonstrate that the new approach is distribution-free and satisfies the Glivenko–Cantelli theorem, which agrees with the previous results in statistics. Moreover, the connection between message importance and distribution goodness-of-fit is established, which verifies that analyzing the data collection with taking message importance into account is feasible.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.