Abstract
Sampling from large dataset is commonly used in the frequent patterns (FPs) mining. To tightly and theoretically guarantee the quality of the FPs obtained from samples, current methods theoretically stabilize the supports of all the patterns in random samples, despite only FPs do matter, so they always overestimate the sample size. We propose an algorithm called multiple sampling-based FPs mining (MSFP). The MSFP first generates the set of approximate frequent items (AFI), and uses the AFI to form the set of approximate FPs without supports ( AFP*), where it does not stabilize the value of any item's or pattern's support, but only stabilizes the relationship ≥ or < between the support and the minimum support, so the MSFP can use small samples to successively obtain the AFI and AFP*, and can successively prune the patterns not contained by the AFI and not in the AFP*. Then, the MSFP introduces the Bayesian statistics to only stabilize the values of supports of AFP*'s patterns. If a pattern's support in the original dataset is unknown, the MSFP regards it as random, and keeps updating its distribution by its approximations obtained from the samples taken in the progressive sampling, so the error probability can be bound better. Furthermore, to reduce the I/O processes in the progressive sampling, the MSFP stores a large enough random sample in memory in advance. The experiments show that the MSFP is reliable and efficient.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.